βΆHow do I calculate sample size before running a test?
Use the formula: n = (2ΟΒ²(z_Ξ± + z_Ξ²)Β²) / (δ²) where Ο is pooled variance, z_Ξ± and z_Ξ² are critical values (1.96 for 95%, 0.84 for 80% power), and Ξ΄ is your Minimum Detectable Effect. Most tools auto-calculate based on baseline conversion, expected lift %, and significance level. Common mistake: assume 10% lift when real products show 1-2%; this requires 100x more users. Tip: always calculate MDE first, reject tests that need >1M users to run.
βΆPeeking at results β why is it dangerous, and how do I avoid it?
Peeking (checking a frequentist test before pre-calculated sample size) inflates false positive rate from 5% to 30%+ because you're running multiple hypothesis tests. Solutions: (1) freeze the sample size in advance and only check once; (2) use sequential testing (mSPRT, always-valid p-values) which is peek-safe by design; (3) use Bayesian inference, naturally tolerant of peeking. Modern platforms (Statsig, Eppo, GrowthBook) default to sequential or Bayesian β old tools require discipline. Budget: assume 3-5% time to resist peeking and enforce fixed end dates.
βΆBayesian vs Frequentist for my test β which do I choose?
Frequentist (p-values, fixed sample size): easiest mental model, what most companies use historically, requires pre-registration. Bayesian (posterior probability of B > A): more intuitive ("95% chance B is better"), peek-safe, excellent for low-traffic scenarios. Pick based on: (1) your team's math comfort and (2) traffic volume. Low-traffic tests (<1k/day): Bayesian wins. High-traffic tests (>10k/day): either works. Consistency beats perfection; pick one framework, stick to it, avoid context-switching.
βΆWhat guardrails should I monitor in every test?
Minimum guardrails: retention (don't kill long-term LTV), key engagement metrics (session duration, DAU), and technical health (error rate, load time, null cohorts). Secondary guardrails: NPS, support tickets, downstream revenue events. Set guardrail thresholds at experiment start (e.g., retention drop >2% = auto-stop test). Flag auto-wins where a guardrail moves 1%+ even if primary metric wins β this catches hidden regressions. Example: paywall test lifts revenue 15% but 90-day retention drops 5% β ship? No, dig deeper.
βΆWhen should I use multi-armed bandits (MAB) instead of A/B testing?
A/B test: fixed allocation (50/50), predefined sample size, one-time decision at the end. MAB: dynamic allocation toward the best variant as it emerges, lower cost in converging traffic. Use MAB when: (1) you can't afford to run 50% traffic on a proven-bad variant (expensive ML model), (2) you have many arms (8+) and want to deprioritize losers fast, (3) iteration speed matters more than statistical certainty. Downside: harder to interpret results, can't compare variants directly. Example: tuning a recommendation algorithm on 50 variants β MAB beats A/B test by 20-30% efficiency.
βΆHow do I know if my result is statistically significant vs practically significant?
Statistical significance = p < 0.05, meaning there's <5% chance the result is due to random noise. Practical significance = the effect size is big enough to care about business. Example: 100k users, +0.1% conversion lift = statistically sig (p=0.03) but costs $500k to implement for $2k/month revenue β not worth it. Always report both: (1) p-value or confidence interval and (2) absolute lift dollars or retention points. Set a Minimum Detectable Effect threshold at test start (e.g., only ship if +1% or better) to avoid chasing tiny wins.
βΆHow big is the salary jump from running individual tests to test design/review?
Test runner ($70-95k) β executes tests, calculates sample size, reports results. Senior experimenter ($100-130k) β designs test roadmaps, reviews other experimenters' work, handles statistical edge cases, builds experiment templates. Lead ($130-160k+) β owns velocity targets, platform strategy, causal inference. The 40-60% jump comes from moving tactical (this one test) to systemic (every test in the org meets standard). Fastest way up: mentor 2-3 junior testers + build a reusable statistical framework for your company.