Question 1

How do I calculate sample size before running a test?

Accepted Answer

Use the formula: n = (2σ²(z_α + z_β)²) / (δ²) where σ is pooled variance, z_α and z_β are critical values (1.96 for 95%, 0.84 for 80% power), and δ is your Minimum Detectable Effect. Most tools auto-calculate based on baseline conversion, expected lift %, and significance level. Common mistake: assume 10% lift when real products show 1-2%; this requires 100x more users. Tip: always calculate MDE first, reject tests that need >1M users to run.

Question 2

Peeking at results — why is it dangerous, and how do I avoid it?

Accepted Answer

Peeking (checking a frequentist test before pre-calculated sample size) inflates false positive rate from 5% to 30%+ because you're running multiple hypothesis tests. Solutions: (1) freeze the sample size in advance and only check once; (2) use sequential testing (mSPRT, always-valid p-values) which is peek-safe by design; (3) use Bayesian inference, naturally tolerant of peeking. Modern platforms (Statsig, Eppo, GrowthBook) default to sequential or Bayesian — old tools require discipline. Budget: assume 3-5% time to resist peeking and enforce fixed end dates.

Question 3

Bayesian vs Frequentist for my test — which do I choose?

Accepted Answer

Frequentist (p-values, fixed sample size): easiest mental model, what most companies use historically, requires pre-registration. Bayesian (posterior probability of B > A): more intuitive ("95% chance B is better"), peek-safe, excellent for low-traffic scenarios. Pick based on: (1) your team's math comfort and (2) traffic volume. Low-traffic tests (<1k/day): Bayesian wins. High-traffic tests (>10k/day): either works. Consistency beats perfection; pick one framework, stick to it, avoid context-switching.

Question 4

What guardrails should I monitor in every test?

Accepted Answer

Minimum guardrails: retention (don't kill long-term LTV), key engagement metrics (session duration, DAU), and technical health (error rate, load time, null cohorts). Secondary guardrails: NPS, support tickets, downstream revenue events. Set guardrail thresholds at experiment start (e.g., retention drop >2% = auto-stop test). Flag auto-wins where a guardrail moves 1%+ even if primary metric wins — this catches hidden regressions. Example: paywall test lifts revenue 15% but 90-day retention drops 5% → ship? No, dig deeper.

Question 5

When should I use multi-armed bandits (MAB) instead of A/B testing?

Accepted Answer

A/B test: fixed allocation (50/50), predefined sample size, one-time decision at the end. MAB: dynamic allocation toward the best variant as it emerges, lower cost in converging traffic. Use MAB when: (1) you can't afford to run 50% traffic on a proven-bad variant (expensive ML model), (2) you have many arms (8+) and want to deprioritize losers fast, (3) iteration speed matters more than statistical certainty. Downside: harder to interpret results, can't compare variants directly. Example: tuning a recommendation algorithm on 50 variants — MAB beats A/B test by 20-30% efficiency.

Question 6

How do I know if my result is statistically significant vs practically significant?

Accepted Answer

Statistical significance = p < 0.05, meaning there's <5% chance the result is due to random noise. Practical significance = the effect size is big enough to care about business. Example: 100k users, +0.1% conversion lift = statistically sig (p=0.03) but costs $500k to implement for $2k/month revenue — not worth it. Always report both: (1) p-value or confidence interval and (2) absolute lift dollars or retention points. Set a Minimum Detectable Effect threshold at test start (e.g., only ship if +1% or better) to avoid chasing tiny wins.

Question 7

How big is the salary jump from running individual tests to test design/review?

Accepted Answer

Test runner ($70-95k) — executes tests, calculates sample size, reports results. Senior experimenter ($100-130k) — designs test roadmaps, reviews other experimenters' work, handles statistical edge cases, builds experiment templates. Lead ($130-160k+) — owns velocity targets, platform strategy, causal inference. The 40-60% jump comes from moving tactical (this one test) to systemic (every test in the org meets standard). Fastest way up: mentor 2-3 junior testers + build a reusable statistical framework for your company.

Region	Junior	Mid	Senior
USA	$80k	$110k	$150k
UK	£50k	£70k	£95k
EU	€55k	€75k	€105k
CANADA	C$85k	C$115k	C$160k

A/B Testing & Experimentation

What is A/B Testing & Experimentation

📋 Before you start

💰 Salary by region

🎓 Certifications

🎯 Careers using A/B Testing & Experimentation

⚖ Compare with

❓ FAQ

Not sure this skill is for you?

Find your ideal career path