Skip to content
7 min

Most A/B tests are underpowered (and the second-most-common mistake is stopping too early)

How does an A/B test that 'shows a 4 percent improvement' end up with revenue flat three months later? Almost always, the test was underpowered: the 4 percent was inside the noise band, statistical power wasn't computed before launch, the duration wasn't fixed in advance, and someone stopped the test on the day the variant looked good.

This is the most common pattern in production A/B tests. It's also the reason the literature on this is full of strong-language papers about how most reported A/B results don't survive replication.

There's a useful framing. An A/B test has three properties you have to fix in advance: the metric, the minimum detectable effect, and the test duration. Any A/B test that fixes fewer than three of these is doing post-hoc analysis dressed as experimentation, and it's why most teams' average effect across years of A/B tests is suspiciously close to zero. The wins they thought they shipped weren't real wins.

Minimum detectable effect: how big does the difference between A and B have to be for you to care? 1 percent? 5 percent? This depends on how much the change cost to build, what the baseline metric is, and what the smallest meaningful business impact would be. You set this number first, in writing, before the test runs.

Sample size: given the MDE and the baseline conversion rate, statistical power tables tell you how many users you need to detect that effect at, say, 80 percent power and 5 percent significance. There are calculators for this. Use them. A 1 percentage-point MDE on a 5 percent baseline at 80 percent power needs roughly 8,000 users per arm. Smaller MDEs scale steeply: 0.5 percentage points pushes the requirement to around 31,000 per arm, and 0.3 percentage points to around 85,000. Most teams don't have that kind of traffic in two weeks.

Test duration: traffic per day divided into the sample size gives you a duration. If it's 21 days, the test runs for 21 days. Not 12 days because the variant looked good on day 12. Not 7 days because someone needed the slide for a board meeting.

The hardest discipline is the pre-commitment. Write down the metric, the MDE, and the duration before the test starts, and don't change them mid-flight. The variant being 'directionally positive' on day 4 isn't a reason to ship; it's noise. The variant being 'flat for three weeks' isn't a reason to keep waiting; it's a real signal that the change didn't move the metric meaningfully at the scale you have.

Three other failure modes

Three other failure modes worth flagging.

The novelty effect: users behave differently for a week or two after a UI change because it's new, not because it's better. The first week of any meaningful UI test should be discarded or the test should run for at least four weeks. The peeking problem: every time you check the test mid-flight and decide whether to stop, you increase the false-positive rate. The fix is to pre-commit to a stopping rule (Bayesian sequential testing, or just: don't peek). The segmentation trap: if 17 segments don't show a significant effect but the 18th does, that 18th wasn't a discovery, it was a coin flip. Pre-register the segments you'll analyse.

One discipline worth keeping: at the start of every test, write down what you expect to happen and why, and what you'll do in each outcome - variant wins big, variant wins marginally, variant flat, variant loses. The act of writing the latter scenarios surfaces what you actually believe and what would actually change as a result. Most variants that 'win marginally' don't ship if the team has already committed to what marginal wins are worth.

A/B tests are tools for not fooling yourself, but only if the discipline is in place before the test runs. Without the discipline, they're elaborate machines for confirming whatever you wanted to ship.

// The artefact
# experiments/sample_size.py: compute the duration before you launch the test
def sample_size_per_arm(baseline_rate: float, mde: float, power: float = 0.8, alpha: float = 0.05) -> int:
    """Approximate per-arm sample size for a two-proportion z-test."""
    z_alpha = norm.ppf(1 - alpha / 2)
    z_beta = norm.ppf(power)
    p1, p2 = baseline_rate, baseline_rate + mde
    p_bar = (p1 + p2) / 2
    n = ((z_alpha * sqrt(2 * p_bar * (1 - p_bar)) +
          z_beta * sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2 / mde ** 2)
    return ceil(n)

# Example: 5% baseline, 1 percentage-point MDE → ~8,150 per arm. Two arms × 14 days
# means you need ~1,200 users/day in the test cell. Halve the MDE and the number
# climbs to ~31,000 per arm; if you can't hit that, the test is decoration.

Sample size before launch. If the math says 6 weeks and you only have 2, the test result will be noise.