Frequently Used Statistics In Product Experiments

After an overview of probability & statistics for data science (excluding ML), I’m going to dive a bit deeper in a few areas often used in product analytics experimentation: power analysis, statistical significance testing, multiple hypothesis adjustments, plus some general guidelines.

Power Analysis

Quite often in AB testing, we want to write up a shareable experiment doc, namely to lay out the product change, the experiment hypothesis, the primary success metric, etc (template). Part of this work is determining how long the experiment should run beforehand (i.e. to avoid peeking and early stopping – since we’ve determine run length beforehand). The basis of a power analysis determination is something along the lines of: ‘assuming there is a true effect, how do we make sure we see it?’ To give an estimate, we need to determine the sample sizes required for each bucket (if we know the sample size, the daily counts, and the cure time for the primary metric, we can translate this to a ‘time to run’ estimate for the test). To determine sample sizes, we need a three key things:

1. Significance level. Typically we use 95% significance (alpha of 0.05), i.e. we are okay with a 5% chance of a Type I error – where we see an effect that was actually due to chance (an alarm without a fire). The lower the significance level, the less sample you’ll need.

2. Power Level (usually 0.80), which is the probability we will correctly reject the null hypothesis for a given true effect (statquest) – ie avoid a Type 2 error (a fire without an alarm). The lower the power, the less sample you’ll need.

The above two are usually kept to the standards given (95% confidence, 80% power) unless you are willing to accept more risk by lowering the thresholds because for example you need to move quickly and cannot afford to gather more sample size OR are more risk averse and raise the thresholds (e.g. to 99% confidence). A more flexible area is the expected lift.

3. Expected Lift from baseline conversion AKA Minimum Detectable Effect (MDE). This is the smallest effect you care to see (e.g. a difference between means / standard deviation for continuous variables and conversion rate difference for binomial outcomes). This is the smallest effect size for which the power is >= our desired power level. Along with power, the MDE determines how ‘sensitive’ the experiment is. If you want to be able to detect a +/-1% relative lift in a conversion rate, you’ll need a much larger sample size than if you want to be able to detect a +/-10% relative lift. Similarly, if the true effect is 5%, then if you have a 10% MDE in your calculations, you’re unlikely to see any impact even though there is one because you didn’t gather enough sample to see it! Remember to clarify if this is a relative or absolute percentage.

Often I will give stakeholders a few options or a table here if needed, for example, maybe you need to run an experiment for 15 weeks with a 5% MDE, but only 3 weeks with a 10% MDE. If the business only has 4 weeks to make a decision, 10% might be okay (e.g. the goal is just to avoid a disastrous change) – whereas if the business needs to know effects bigger than 5% (e.g. due to costs/benefit for a vendor decision) and time is not an issue, we should use 5%. Granted, MDE calculations are sometimes negotiable/more flexible and require looking at the historical conversion rates (or whatever success metric is chosen) and estimating what the baseline will be. The most ideal MDE is one based on a very similar past experiment (e.g. we ran this for iOS apps and there was a 3% lift, now we are running it for android: 3% is not a bad estimate for a MDE).

Note that for t-tests, one would replace the baseline conversion by 2 additional quantities: the historical sample mean and the historical sample standard deviation for the control group (statsmodels function).

Now, one thing not included in the above three components is the type of test (superiority, non-inferiority, equivalence), i.e. what the null hypothesis is. Most common is the superiority test, where the null hypothesis is the treatment performs the same or worse to the control, and the alternative hypothesis is that the treatment is better than / ‘superior’ to control (i.e. we want to avoid implementing a solution that is not better than what we have in control). Meanwhile, non-inferiority tests aim to make sure that the new experience is not worse than control. Here the null hypothesis is that the new version is inferior to the existing version by a predefined margin – similar to the MDE size (and the alternative hypothesis is that it is not inferior). Equal, or even a little worse, is not necessarily ‘inferior’ depending on the margin for a non-inferiority test. Lastly, equivalence tests have a null hypothesis that there is a significant difference between groups, and the alternative hypothesis is that there is not. The union of the null and alternative hypothesis must be equal to the set of all possible values that the treatment effect might take.

Another consideration is if this is a one-tailed vs two-tailed test. A two-tailed test looks for an effect, regardless of direction, so it requires a larger sample size compared to a one-tailed test, where directionality matters (e.g. if you are only looking for a lift or nothing). A general rule of thumb is that a two-tailed test is more cautious and good for research to see if something is up or down, a one-tailed superiority test is good if you only want to know if there was a lift for something basic like a button change, and a one-tailed non-inferiority test is good for a ‘do no harm’ or required change to make sure nothing too negative happened. Most online calculators are effectively two-tailed non-inferiority tests, which have a null hypothesis that there is no effect, and an alternative hypothesis that there is one.

Beyond significance level, power level MDE (see example calculator), and type of test, other considerations include: deciding which primary success metric to use (obviously a faster curing metric takes less time), the ratio between both groups (usually 50-50, since the smaller sample determines run-time, but sometimes we do 90-10 ratios to reduce risk and only show treatment to 10%), one vs two-tailed test (e.g. do you only care about one direction for the difference?), and certain Bayesian methods where you make some assumptions on the distributions before the experiment starts to reduce variance and need for more sample size (such as a multi-arm bandit – MAB) – which are effectively self-stopping. Another thing you can do to reduce time to run with the same sample size is try to broaden the sample thrown in the experiment (e.g. maybe we combine US and CA customers for the experiment to increase the population and make the decisions together for both).

In practice there is another concern: data loss. Maybe you think you will get 1,000 new web site visits a day, and you need 5,000 in each bucket – so you will run your experiment for 10 days to gather the full sample ((5,000 x 2) / 1000). But on day one, you get 900 in your sample (but it’s a Sunday, so maybe it’s just a slow day), and on Monday you get 850. What’s going on? Well, it could be bucketing issues such as a data privacy change (e.g. people opting out of cookie tracking). Or hell, maybe the count is higher than you expected (e.g. from people clearing their cookies and re-entering the experiment). This could also sometimes be bucketing in the wrong place – typically you can bucket before an experience or when the experience loads (and obviously not after the changed experience). Bucketing before a new experience risks potential sample ratio mismatches, but if the experience is for a downstream subsample, early bucketing can give a check on effects by comparisons to the unchanged bucketed group. Regardless, make sure you are bucketing on the right level of customer! This all might not be part of the power analysis formula, but can affect the sample size gathered. In general, padding the sample a bit never hurts in practice. One way we’ll often do this is to always run our experiments in full weeks, so if we need 12 days, we’ll round to 14 to make sure what we gather is not affected by day of the week.

In addition to data loss, a good rollout plan with clear checks can help reduce issues of translating an experiment into practice. If day 1 we roll the experiment out to 100% of the sample (rg 50% show, 50% control), and the treatment breaks the experience, you may catch it faster than if you only rolled the experiment out to 10%, but you also give more people a broken experience. Running an A-A test to start off an experiment is also a good best practice to debug issues (i.e. get sample size, and maybe even do a quick check on some data distributions to confirm randomness. You should expect no effect here given you’re comparing two identical experiences). Netflix often also does a slow rollout to treatment after an experiment is deemed successful as another double check an experience (blog).

Ultimately you don’t know how the experiment will turn out, and this is the best guess to get an estimate for how long it’ll take (i.e. how much sample it will cost) to find out an effect. This is effectively a formula that needs balancing somehow: you don’t get good learnings for free from experiments. Statistical power is fundamentally a measure of how much information you gathered in your study. Running an experiment with a tiny sample will take a long time, even using variance reduction techniques to make it faster and that’s a reality that needs to be communicated to stakeholders, in addition to the limits in learning.

Okay, so how much can we fudge this? Let’s look at an illustrative example with a 10% baseline rate (below) using a binomial two-tailed non-equivalence a priori power analysis to compute the estimated sample size.

Significance Level	Power	Relative MDE	Sample per variation
95%	80%	5%	57,000
95%	75%	5%	50,000
90%	80%	5%	45,000
95%	80%	10%	14,300
90%	60%	10%	6,500

calculator

Changing the power and significance level have a smaller impact than changing the MDE (there’s some diminishing returns there). But if we assume 1,000 sample per day per bucket, and the 10% conversion rate, with some assumption changes you can run the experiment with a tenth of the sample, but you do not get it without information loss.

Walking through this hypothetical scenario, with 2K sample per day, split 50-50 to 1K per day. Also assuming no data loss, a 10% baseline conversion rate, and a same-day curing metric. At 95% significance and 80% power for a two-tailed superiority test, we could run this for 14 days with a 10% MDE (threshold to detect), or 63 days (rounding up to a full week from 57 days) with a 5% MDE.

Now, let’s take a look at three different scenarios: do no harm, lift only, or total impact:

Significance – Power – MDE	Tails	Test Type	Margin	Purpose	Sample per variation
95% – 80% – 5%	Two	Non-Equivalence	N/A	Total Impact	57,000
95% – 80% – 5%	One	Superiority	N/A	Lift Only	45,000
95% – 80% – 5%	One	Non-Inferiority	2.5%	Do No Harm	20,000
95% – 80% – 5%	One	Non-Inferiority	5%	Do No Harm	11,000

For the Non-Inferiority test we have to include a margin, which here I put 5% (relative) to match the MDE).

For the superiority tests, notice how the one-tailed superiority test estimated sample size (45K) requires a smaller sample than the two-tailed superiority test with the same significance, power, and MDE. This is because the two-tailed test is looking above and below the distribution (article) – so it’s really looking at the 2.5% ends of the distribution, whereas the one-tailed superiority test only looking at 5% on one end. To have the same sample size requirement for a two-tailed superiority test (45K) we can see in the previous table that a 90% confidence achieves this (since similarly it’s looking at 5% on either end). We also see the non-inferiority test has a much smaller sample size requirement (granted we are inputing a margin) – which makes sense given we are looking to see if the change was below a certain margin, rather than different than 0. While we can reduce the required sample with non-inferiority tests, we lose some of what we are testing for with the superiority test.

Statistical Significance Testing

Okay, so let’s say we completed our power analysis above, shared our experiment doc with stakeholders and even ran A-A test checks for a few days to verify that everything looks good. We start the experiment. I like to pull the sample per bucket by date after a week or so to verify it’s randomized properly, i.e. there’s no sample ratio mismatch (SRM) and the sample is split 50-50. We also want to maybe check some other distributions to make sure the two samples are comparable (e.g. checking a few key characteristics like cohort/start date to make sure this is an apples to apples and truly randomized comparison). We also want to make sure there is an even distribution of OTHER AB tests going on, and we’re not accidentally measuring effects from two experiments at once due to overlap issues). Also no early peeking! Let’s assume this is fine, and that the test is completed, the data is cured, and we want to see if there is an effect.

If this was a test on continuous variables (e.g. revenue), we would use a t-test to compare between means, maybe with some variance reduction techniques like outlier removal (such as removing 1-5% of outliers, winsorizing, taking log values, etc) to get a more full picture. If the test had a binomial outcome such as conversion, we would use a Z-test or Chi-Square test (calculator – video) – comparison (really it doesn’t matter which one you use here). I have run a lot of tests using conversion as a success metric, so I typically use a simple Chi-Square formula in Google sheets to show a summary. ANOVA can also be useful to look at mean differences across multiple groups. Non-technical stakeholders often find confidence intervals more digestible than p-values (nice article).

Multiple Hypothesis Adjustment

In my probability and stats overview, I touched briefly on p-hacking, but one area that is sneaky is adjustments for multiple hypotheses. For example, say we ran an experiment that included Mexico, Canada, and US customers – with the primary success metric of conversion to a revenue event. It’s natural we’ll want to look at conversion broken out by country, but it increases the odds of us seeing an effect (p-value <= 0.05 given that threshold of statistical significance). In fact, if we split this out into 20+ subgroups, we would expect to see a p-value <0.05 just by chance! So, to adjust for this, we’ll use a correction (article).

The most commonly used method is the Bonferroni Correction, where you take the p-values and divide by the number of hypotheses, but I’ll often use Holm’s adjustment as Bonferroni lead to underpowered tests. Say we had the following p-values: 0.021, 0.025, 0.30 (for CA, MX, US respectively). If we DIDN’T make a correction for multiple hypotheses, we would say CA, MX were stat sig, and the US was not. If we used the Bonferroni Correction, we would divide our threshold of 0.05 by 3, and use 0.017 instead, and say none of the by-country effects were significant, but Canada was pretty close (as was Mexico). With Holm’s adjustment, we take the p-values, sort smallest to largest and use a moving threshold by dividing 0.05 by the number of hypotheses minus the position+1. It sounds complicated, but in practice what this means is that the threshold for the lowest p-value, 0.021 is 0.05 / (3 -1 + 1) = 0.017. We divide the second lowest ranked p-value by 0.05 / (3-2+1) = 0.025, and the third by 0.05 / (3-3+1) = 0.05. This would mean we would say that Mexico had a stat sig difference in conversion, and CA was close. This is tricky, no correction is almost certainly the wrong thing to do here, Bonferroni can be too harsh, and I’ve found Holm’s adjustment to be a good middle road.

Good article on AB testing with multiple metrics.

General Guidelines

A single study doesn’t tell you much about what the world is like, it has little value at all unless you know the rest of the literature and have a sense about how to integrate these findings with previous ones.

Calling Bullshit by Carl T. Bergstrom and Jevin D. West

Any analysis can always get more complex and we can always be more rigorous: be careful not to procrastinate on recommending a decision by adding needless complexity when something ballpark will do to move forward. On the other hand, the value of experimentation (or the scientific method for that matter) is to test our hypothesis / gut instinct and find out when we were wrong (or in rarer cases, correct): don’t waste an opportunity to add new knowledge to an organization, as it does not come for free. Deciding how much rigor to add to an experiment past a certain basic level is more of an art than a science – it depends on how much needs to be known and how much time you have to learn it (e.g. check metrics might change a decision in some cases, and just be interesting in others). The ‘art’ part depends on clarity around strategy and often depends on how much pressure or pushback might surface from decisions that result from an experiment. You might want a bit more rigor around an experiment that tests a new strategy for the company to be presented to executives than a blue vs red button. Sometimes in product experimentation, stakeholders see AB testing as just gaiting a launch or a last step risk assessment, when really in its full form we are conducting actual research on customers which can sometimes lead to profound insights. If you are the first data scientist in an area, maybe you have less rigor at first to build trust with your stakeholders that you want to add value fast and support the team’s goals, or maybe you start off with a lot of rigor to emphasize that there is power in data to affect decisions and we want to be sure we get the full insights.

Ultimately there are a few key goals with every experiment and a few key things to avoid. The things we want to avoid are: really bad decisions (i.e. launch something that’s really bad or not launch something really good), poor expectation setting (i.e. procrastination on hard decisions due to struggling on the uncertainty of neutral results, etc), and misconstruing reality (i.e. learning the wrong learnings or disproportionate confidence). This is all, of course, in addition to avoiding general bad experimentation practices like peeking, or poor set up, data loss, etc. The things you want to be done well are: prepare for every outcome beforehand (i.e. explain how things might be positive, negative, neutral with some degree of confidence), be consistent (i.e. if another DS conducted this experiment, they would come up with similar results, even if not identical), and make sure the rigor and focus matches the goal (i.e. if the goal is to avoid disasters / ‘do no harm’, make sure you do that first and the rest is bonus, if the goal is to give guidance and inform strategy, make sure you do that to the best you can honestly). Experimentation is really about beliefs and intellectual honesty. Statistics adds rigor and height to goals and weight to outcomes and over time can result in better decision-making and less wasted time for everyone.

Shout out to Kasia Rachuta & Nicholas Topousis for the review / feedback!