unspurious.calculators

Core inference · Experiments

A/B Test Calculator

Did your variant really beat the control, or did it just get lucky? Enter the visitors and conversions for each version to get the conversion rates, the absolute and relative uplift, the statistical significance (a two-proportion z-test) and a confidence interval for the difference — with blunt warnings about the two ways A/B tests fool people: stopping early, and confusing “significant” with “worth shipping”.

A “conversion” is whatever you count as success: a signup, a sale, a click. Decide your sample size and stopping point before you start — see the warning below.

Result

In plain English

An A/B test compares two conversion rates measured on random visitors. Because each rate is only an estimate from a sample, some gap between them appears even when the two versions are identical. The test asks whether the gap you saw is bigger than that random wobble would comfortably produce — and the confidence interval shows how big the true difference plausibly is.

conversion rate
Conversions ÷ visitors for each version — your estimate of its true success rate, with sampling noise baked in.
absolute vs relative uplift
Absolute uplift is the gap in percentage points (13% − 10% = 3 pts); relative uplift expresses it as a share of the control (3 ÷ 10 = +30%). Marketing loves the bigger-sounding relative figure.
p-value
The chance of seeing an uplift at least this large if the two versions were truly identical. Small means the gap is hard to dismiss as luck — it is not the probability B is better.
confidence interval
The range of true differences compatible with your data. If it excludes 0, the result is significant; its width shows how precisely you have pinned the effect down.
power & significance
Statistical significance says an effect is probably real; it says nothing about whether it is big enough to matter. Both depend on having enough data, fixed in advance.

Frequently asked

How do you know if an A/B test is statistically significant?

Compare the two conversion rates with a two-proportion z-test. If the p-value is below your threshold (commonly 0.05) — equivalently, if the confidence interval for the difference excludes zero — the result is statistically significant: the observed uplift is unlikely to be pure chance. Significance also requires that you fixed your sample size in advance and did not stop the moment the numbers looked good.

Why shouldn't I stop the test as soon as it's significant?

Because of peeking. If you check the p-value repeatedly and stop the instant it dips below 0.05, you will cross that line by chance far more than 5% of the time even when the variants are identical — inflating false positives dramatically. Decide the sample size and end date before launching, and judge the result only then. Checking many times and stopping on a win is a form of p-hacking.

Is a statistically significant result always worth shipping?

No. With enough traffic, a trivial uplift — a hundredth of a percentage point — can be statistically significant yet worthless. Always read the confidence interval: it tells you not just that there is an effect, but how big it plausibly is. A significant result whose interval runs from "barely positive" to "modest" may not justify the cost of the change.

Can I test more than two variants at once?

Yes — an A/B/n test — but every extra variant is another comparison, and the chance that at least one looks significant by luck climbs with each one (the multiple-comparisons problem). If you pit several variants against the control, lower your threshold (for example a Bonferroni correction, dividing 0.05 by the number of comparisons) or treat the result as exploratory and confirm the apparent winner in a fresh two-way test.