Confidence intervals

Confidence intervals are notoriously difficult to understand and interpret correctly. This is an attempt to provide a simplified explanation that allows decision makers who don't have a deep background in statistics to effectively use confidence intervals in their work.

Why use confidence intervals?

Growth teams often run experiments in an attempt to predict the future. For example, we run a painted door test on a subset of users (e.g. users who visit the site in a two week period of time) and then attempt to estimate the click-through-rate we might observe if we were to actually build the feature and roll it out to all users in perpetuity.

A common mistake made by growth teams is to look at the observed click through rate in the painted door test and assume that they will observe this same click through rate if they were to rollout to all users.

Instead, the observed click-through-rate should be thought of as an estimate of the real click-through-rate. However, it's possible that due to chance this particular painted door test produced a result that is actually quite different from the click-through-rate you would see if you rolled it out. This single estimate (also called the point estimate) is useful, but it is somewhat risky to assume that you would see this exact same effect on rollout.

In contrast, a confidence interval gives you a range of values that can be used to help you quantify the uncertainty around your estimate. If the range of the confidence interval is small, then we have a pretty good estimate. If the range is large, then there is still a lot of uncertainty.

A 95% confidence interval tells us that if we were to re-run this same experiment an infinite number of times, the confidence intervals we would generate would contain the real click-through-rate ~95% of the time.

How to think about the confidence interval

You can think of the confidence interval as a range of values (in this case: click-through-rates) that we are reasonably confident contains the real value you would observe if you rolled this change out to all users.

A common misinterpretation

A 95% confidence interval does not mean that there is a 95% chance that our experiment will produce a confidence interval that contains the true value.

This is a subtle difference that confused me at first. Other sources dive deeply into this, but for us mortals, it's a fine heuristic to just think of the interval as a range of values that we are reasonably confident contains the true value.

It's also worth noting that confidence intervals are generated by using frequentist methods. Bayesian methods produce credible intervals, which are also useful, but different.

Deepening our understanding

As we've seen, confidence intervals are difficult to think about intuitively (and correctly), so let's dive into an example to see if that helps.

Let's imagine that we run an e-commerce shop dedicated to selling jeans. We'll call our shop "Jean Therapy". We think a chatbot will help increase our sales. However, it will take some time to find a good chatbot and integrate it into our system. To quickly understand how many users would even use a chatbot if we offered one, we decide to run a painted door test.

Our experiment design is that we will show a small button in the bottom right hand corner of the screen that advertises a chat experience. If a user clicks the button, we record their click and display a message saying that this feature isn't available right now.

If we can estimate the fraction of users who click on the chat widget out of the overall population, we will have a much better understanding of how much our users actually want this feature and how much it could conceivably impact our sales numbers.

This is far-fetched, but stick with me for a moment. Let's imagine that we run this experiment 100 times in rapid succession instead of just once. We record the confidence interval generated by analyzing each of these experiments. Afterwards, we rollout the chat widget to all of our traffic and observe the actual click-through-rate over a long period of time. The real click-through-rate measured after rollout is 10%. We then go back and look at all of our recorded experiment results to see how good they were at estimating the true click-through-rate.

You can tinker with the parameters here to see how they impact the confidence intervals that are generated by each experiment.

Sample size for each experiment:


Confidence level:

Run Simulation

Given these experimental parameters, of the experiments generated confidence intervals that contained the true click-through-rate of 10%.

Now, in reality, you would probably only run one of these experiments. So any of these generated confidence intervals are plausible outputs from your experiment.

Observations you might make after tinkering
  1. Smaller sample sizes result in bigger confidence intervals. Smaller confidence intervals are more useful to you since they give you a better estimate of the true click-through-rate.
  2. Lower confidence levels generate smaller confidence intervals, but also contain the true click-through-rate less frequently.
  3. Even with a high confidence level and a large sample size, it's possible your experiment will produce a confidence interval that doesn't contain the true click-through-rate.
  4. You shouldn't assume that the confidence interval you generate in a single experiment is the "correct" one. There is a fair amount of variability in the generated intervals, however, a large fraction of them do contain the true click-through-rate.