The Analyst's Guide to A/B Testing: From Hypothesis to Decision

Published on: October 22, 2025

A/B testing is the gold standard for causal inference in product and marketing analytics. Yet most teams run it wrong — peeking at results early, under-sizing samples, or calling statistical significance a business win without checking effect size. This guide covers how to do it right, from framing the hypothesis to making the final call.

Step 1: Write a Falsifiable Hypothesis

A good hypothesis has three components: a change, a metric, and a direction. For example:

"Showing a progress bar during checkout will increase the checkout completion rate by at least 3% relative to the current baseline."

Vague hypotheses like "the new design will perform better" don't tell you what to measure, what threshold defines success, or when to stop. Be precise upfront — it prevents HARKing (Hypothesising After Results are Known) later.

Step 2: Calculate Your Required Sample Size

The two biggest mistakes in A/B testing are running tests too short and stopping them the moment results look good. Both are solved by calculating sample size before the test begins. You need four inputs:

Baseline conversion rate (p₁): Your current metric value.
Minimum Detectable Effect (MDE): The smallest improvement you care about. Smaller MDE → larger sample needed.
Statistical significance (α): Typically 0.05. The false positive rate you're willing to accept.
Statistical power (1−β): Typically 0.80. The probability of detecting a real effect.

from statsmodels.stats.power import zt_ind_solve_power
import numpy as np

baseline    = 0.12   # 12% conversion rate
mde         = 0.015  # Detect a 1.5pp lift or more
alpha       = 0.05
power       = 0.80

effect_size = (baseline + mde - baseline) / np.sqrt(
    baseline * (1 - baseline)
)

n_per_variant = zt_ind_solve_power(
    effect_size=effect_size,
    alpha=alpha,
    power=power,
    alternative='two-sided'
)

print(f"Required sample per variant: {int(np.ceil(n_per_variant))}")

Step 3: Run the Test Correctly

A few non-negotiables once the test is live:

No peeking. Checking results daily and stopping when p < 0.05 inflates your false positive rate dramatically. Wait until the pre-determined sample size is reached.
Randomise at the user level. If the same user can end up in both variants across sessions, your results are contaminated.
Run a sanity check (A/A test). Before your A/B test, confirm your randomisation and tracking infrastructure works by running an A/A test — both groups should have statistically similar metrics.
Guard against novelty effects. For UX changes, users may behave differently simply because something is new. Run tests long enough to capture steady-state behaviour, typically at least one full week to account for day-of-week patterns.

Step 4: Interpret Results Correctly

A p-value below 0.05 tells you the result is unlikely under the null hypothesis. It does not tell you the effect is practically meaningful. Always report:

The observed lift — absolute and relative.
The confidence interval — does the lower bound still represent a business-worthwhile improvement?
The estimated revenue/retention impact — translate the lift into business terms your stakeholders care about.

from scipy import stats

# Variant results
ctrl_conv,  ctrl_n  = 1180, 10000   # 11.8% conversion
test_conv,  test_n  = 1340, 10000   # 13.4% conversion

p_ctrl = ctrl_conv / ctrl_n
p_test = test_conv / test_n

# Two-proportion z-test
count  = [test_conv, ctrl_conv]
nobs   = [test_n, ctrl_n]
z, p   = stats.proportions_ztest(count, nobs)

lift   = (p_test - p_ctrl) / p_ctrl * 100
print(f"Relative lift: {lift:.1f}%")
print(f"p-value: {p:.4f}")
print(f"Significant: {p < 0.05}")

Common Mistakes to Avoid

Multiple comparisons without correction: Testing 10 metrics at α=0.05 gives a ~40% chance of at least one false positive. Declare your primary metric upfront; treat secondaries as exploratory.
Segment dredging post-hoc: Finding a "significant" result only in a subgroup after the fact is almost always noise. Pre-register subgroup analyses if they matter.
Ignoring long-term effects: A feature that boosts short-term engagement but increases churn at 30 days is a net negative. Build holdout groups to measure lasting impact.

Conclusion

A/B testing done well is one of the most powerful tools in an analyst's toolkit. The discipline it imposes — pre-registered hypotheses, power calculations, clean randomisation — forces rigorous thinking that makes every product decision more defensible. The teams that consistently do it right build a compounding informational advantage over those that just chase p-values.