Being fast is more important than being wrong
Fundamentally, running an A/B test is a risk mitigation strategy. An A/B test can substantially reduce the risk that the team will invest a lot of time and energy into building something that users don't need and/or has a neutral or even negative impact on key business metrics like revenue.
Often, when we talk about risk management for A/B test we are concerned about making the "wrong" decision. Either we choose to rollout a feature that we shouldn't have (this is called a "type 1" error or a "false positive") or we choose not to rollout a feature we should have (a "type 2" error or a "false negative"). We use statistical methods to reduce the risk of making the wrong decision because of these errors. These statistical methods don't fully remove the errors, they just limit them.
My understanding is that most of the default values used for confidence levels and statistical power were chosen in the context of high-stakes research such as experiments for testing new vaccines or cancer treatments. Digital product management is clearly a very different environment. For the vast majority of digital products, making the "wrong" decision results in some lost revenue for the business, but nobody dies!
There are often-quoted studies that have shown that in organizations that tracked it, only 10-30% of ideas that they A/B tested actually resulted in a statistically significant positive impact in their outcome metric 1 2. By trying to meet the same quality standard as medical studies, one could argue that we are setting the bar too high for lower-stakes experiments on digital products. If the alternative is that we don't A/B test at all, we could assume that we would make the correct decision roughly ~30% of the time. Therefore, our A/B testing program just has to ensure we are correct more often than that. Even if we add some wiggle room to account for the cost of A/B testing, clearing an accuracy threshold of 40% accuracy is much easier to achieve than 95%. Ideally we want to get as close to 100% accuracy as is feasible, however, the closer we try to get to that goal, the longer it takes to run experiments. The relationship is non-linear, so going from 30% to 40% accuracy has a much smaller impact on velocity than going from 95% to 99%. Businesses that aimed for 99% accuracy would release many fewer changes than a business that was comfortable being "wrong" 50% of the time.
This same argument can be made when considering interaction effects, proxy metrics, novelty effects, and other quality issues that A/B tests can face. We don't need to be 100% accurate, we just need to accrue as much value as quickly as possible.
Being fast is more important than being wrong.
-
In 2009 Ronny Kohavi stated that Microsoft only saw ~30% of their experiments produce statistically significant positive results. Kohavi, Ronny, et al. "Online experimentation at Microsoft." Data Mining Case Studies 11.2009 (2009): 39. ↩
-
"At Booking.com, only about 10% of experiments generate positive results" Building a Culture of Experimentation Harvard Business Review March-April 2020. Retrieved 2023-04-01. ↩