βΆFrequentist vs Bayesian A/B testing β which should I use?
Frequentist (p-values, fixed sample size) is what most tools/companies use; works well for high-traffic web tests with clear pre-registered hypotheses. Bayesian (probability of B beating A) is more intuitive, lets you peek without inflating false positives, and handles low-traffic scenarios better. Modern platforms (Statsig, Eppo, GrowthBook) default to Bayesian or sequential testing. For a new program: pick the framework your team understands, not the 'correct' one. Both work if used right.
βΆHow do I avoid the peeking problem?
Peeking = checking a frequentist test before reaching pre-calculated sample size. Inflates false positive rate from 5% to 30%+. Solutions: (1) calculate sample size in advance, only check at the end; (2) use sequential testing (always-valid p-values, e.g. mSPRT) which is peek-safe; (3) use Bayesian inference, which is naturally peek-tolerant. Modern platforms handle this automatically β old platforms (Google Optimize, simple t-tests) require discipline.
βΆHow do I prioritize which tests to run?
ICE (Impact Γ Confidence Γ Ease, 1-10 each) for fast triage. PIE (Potential Γ Importance Γ Ease) for marketing. RICE (Reach Γ Impact Γ Confidence Γ· Effort) for product. The framework matters less than: (1) writing down your reasoning before running; (2) reviewing predictions vs results quarterly; (3) killing low-traffic tests that can't reach significance. Top experimentation programs maintain a backlog of 50-100 ideas, run 20-50/quarter, ship the wins.
βΆWhy do most A/B tests come back inconclusive?
Two reasons: (1) effect sizes in mature products are tiny β 0.5-2% lifts are typical, requiring tens of thousands of users to detect. (2) tests are often underpowered: sample size calculated for a 10% lift, actual effect is 1%, you need 100x more users. Fix: calculate Minimum Detectable Effect (MDE) before launch, accept that 70%+ of tests will be flat, and treat 'inconclusive' as useful data ('don't waste eng cycles on this idea').
βΆTool stack: Optimizely vs GrowthBook vs Statsig vs LaunchDarkly?
Optimizely: enterprise, expensive ($30k+/yr), strong WYSIWYG editor, great for marketing teams. GrowthBook: open-source, self-host, dev-friendly, free tier. Statsig: free tier + great Bayesian stats, becoming the modern default. LaunchDarkly: feature flags first, experimentation second; pick if you already have it. For startups: Statsig free tier or GrowthBook. For enterprise: Optimizely or self-hosted GrowthBook. For Posthog stacks: PostHog Experiments built-in.
βΆWhat's a 'guardrail metric' and why do I need them?
Guardrail = secondary metric that monitors for unintended harm even when primary metric wins. Example: testing a paywall change β primary = revenue (up); guardrail = retention 90d (must not drop). Without guardrails, you ship 'wins' that destroy long-term LTV. Standard guardrails: retention, NPS, support tickets, page load time, error rate. Mature programs auto-flag tests where any guardrail moves > 1% even if primary wins.
βΆHow big is the salary jump from running tests to running a program?
Practitioner ($80-110k) β runs individual tests, picks variants, calculates significance. Strategist ($110-150k) β designs prioritization framework, builds test backlog, mentors. Lead ($150-200k+) β owns the experimentation platform, sets test velocity targets, runs experimentation guild. The 2-3x salary lift comes from moving from tactical (this test) to systemic (org runs 50 tests/quarter, here's the playbook).