Published on: October 22, 2025
A/B testing is the gold standard for causal inference in product and marketing analytics. Yet most teams run it wrong — peeking at results early, under-sizing samples, or calling statistical significance a business win without checking effect size. This guide covers how to do it right, from framing the hypothesis to making the final call.
A good hypothesis has three components: a change, a metric, and a direction. For example:
"Showing a progress bar during checkout will increase the checkout completion rate by at least 3% relative to the current baseline."
Vague hypotheses like "the new design will perform better" don't tell you what to measure, what threshold defines success, or when to stop. Be precise upfront — it prevents HARKing (Hypothesising After Results are Known) later.
The two biggest mistakes in A/B testing are running tests too short and stopping them the moment results look good. Both are solved by calculating sample size before the test begins. You need four inputs:
from statsmodels.stats.power import zt_ind_solve_power
import numpy as np
baseline = 0.12 # 12% conversion rate
mde = 0.015 # Detect a 1.5pp lift or more
alpha = 0.05
power = 0.80
effect_size = (baseline + mde - baseline) / np.sqrt(
baseline * (1 - baseline)
)
n_per_variant = zt_ind_solve_power(
effect_size=effect_size,
alpha=alpha,
power=power,
alternative='two-sided'
)
print(f"Required sample per variant: {int(np.ceil(n_per_variant))}")
A few non-negotiables once the test is live:
A p-value below 0.05 tells you the result is unlikely under the null hypothesis. It does not tell you the effect is practically meaningful. Always report:
from scipy import stats
# Variant results
ctrl_conv, ctrl_n = 1180, 10000 # 11.8% conversion
test_conv, test_n = 1340, 10000 # 13.4% conversion
p_ctrl = ctrl_conv / ctrl_n
p_test = test_conv / test_n
# Two-proportion z-test
count = [test_conv, ctrl_conv]
nobs = [test_n, ctrl_n]
z, p = stats.proportions_ztest(count, nobs)
lift = (p_test - p_ctrl) / p_ctrl * 100
print(f"Relative lift: {lift:.1f}%")
print(f"p-value: {p:.4f}")
print(f"Significant: {p < 0.05}")
A/B testing done well is one of the most powerful tools in an analyst's toolkit. The discipline it imposes — pre-registered hypotheses, power calculations, clean randomisation — forces rigorous thinking that makes every product decision more defensible. The teams that consistently do it right build a compounding informational advantage over those that just chase p-values.