This article was written by Adrian Colyer.
Pure Gold! Here we have twelve wonderful lessons in how to avoid expensive mistakes in companies that are trying their best to be data-driven. A huge thank you to the team from Microsoft for sharing their hard-won experiences with us.
In our experience of running thousands of experiments with many teams across Microsoft, we observed again and again how incorrect interpretations of metric movements may lead to wrong conclusions about the experiment’s outcome, which if deployed could hurt the business by millions of dollars.
Before we dive into the details, there are a couple of interesting data points about experimentation at Microsoft I wanted to draw your attention too:
- Microsoft have an experimentation system used across Bing, MSN, Cortana, Skype, Office, xBox, Edge, Visual Studio, and so on, that runs thousands of experiments a year.
- This isn’t just fiddling on the margins – “At Microsoft, it is not uncommon to see experiments that impact annual revenue by millions of dollars, sometimes tens of millions of dollars.”
If we’re going to talk about metric interpretation then first we’d better have some metrics. Section 4 in the paper has some excellent guidelines concerning the types of metrics you should be collecting. You’re going to need a collection of metrics such as these in order to diagnose and avoid the pitfalls.
Let’s talk about metrics.
Most teams at Microsoft compute hundreds of metrics to analyze the results of an experiment. While only a handful of these metrics are used to make a ship decision, we need the rest of the metrics to make sure we are making the correct decision.
There are four primary classes of metrics that Microsoft collects:
- Data Quality Metrics. These help to determine whether an experiment was correctly configured and run, such that the results can be trusted. An important example is the ratio of the number of users in the treatment to the number of users in the control. You need this metric to avoid the first pitfall…
- Overall Evaluation Criteria (OEC) Metrics. The leading metrics that determine whether the experiment was successful. These are metrics that can both be measured during the short duration of an experiment, and are also indicative of long term business value and user satisfaction. In the ideal case, a product has just one OEC metric.
- Guardrail Metrics. “In addition to OEC metrics, we have found that there is a set of metrics which are not clearly indicative of success of the feature being tested, but which we do not want to significantly harm when making a ship decision.” These are the guardrail metrics – on MSN and Bing for example, page load time is a guardrail metric.
- Local Feature and Diagnostic Metrics. These measure the usage and functionality of individual features of a product. For example, click-through rates for individual elements on a web page. These metrics often help to diagnose / explain where OEC movement (or lack of it) is coming from. They can also reveal situations where improvements in one area harm other areas.
Now the pitfalls:
- Pitfall #1: Metric sample ratio mismatch
- Pitfall #2: Misinterpretation of ratio metrics
- Pitfall #3: Telemetry loss bias
- Pitfall #4: Assuming underpowered metrics had no change
- Pitfall #5: Claiming success with a borderline p-value
- Pitfall #6: Continuous monitoring and early stopping
- Pitfall #7: Assuming the metric movement is homogeneous
- Pitfall #8: Segment (mis)interpretation
- Pitfall #9: Impact of outliers
- Pitfall #10: Novelty and primacy effects
- Pitfall #11: Incomplete funnel metrics
- Pitfall #12: Failure to apply Twyman’s Law
To read the whole article with detailed explanations for each pitfall, click here.