Skip to main content
JobCannon
All skills

A/B Testing Strategy

Data-driven decision making through controlled experimentation

β¬’ TIER 2Industry
+$20k-
Salary impact
6 months
Time to learn
Medium
Difficulty
12
Careers
AT A GLANCE

A/B testing strategy is the org-level discipline of running controlled experiments to make product/marketing/UX decisions with data, not opinion. Mature programs (Booking, Netflix, Airbnb) run thousands of tests per year. Career path: Practitioner (run individual tests, $80-110k) β†’ Strategist (program design, prioritization, $110-150k) β†’ Experimentation Lead (platform + culture, $150-200k) over 6-12 months. Built on stats (sig testing, sample size, MDE), tooling (Optimizely, GrowthBook, Statsig, LaunchDarkly), and prioritization frameworks (ICE, PIE, RICE).

What is A/B Testing Strategy

A/B testing strategy goes beyond running individual tests to building a systematic experimentation program. It includes hypothesis formation, test prioritization (ICE framework), statistical rigor, test velocity optimization, and building an experimentation culture across the organization. Companies with mature experimentation programs (Google, Netflix, Booking.com) run thousands of tests annually. A well-designed testing program accelerates learning velocity and removes opinion-based decision-making.

πŸ”§ TOOLS & ECOSYSTEM
OptimizelyGrowthBookStatsigLaunchDarklyVWOGoogle Optimize (deprecated)EppoSplit.ioAmplitude ExperimentMixpanelPosthogUnbounce

πŸ“‹ Before you start

πŸ’° Salary by region

RegionJuniorMidSenior
USA$95k$135k$185k
UKΒ£55kΒ£80kΒ£115k
EU€60k€85k€120k
CANADAC$100kC$140kC$190k

❓ FAQ

Frequentist vs Bayesian A/B testing β€” which should I use?
Frequentist (p-values, fixed sample size) is what most tools/companies use; works well for high-traffic web tests with clear pre-registered hypotheses. Bayesian (probability of B beating A) is more intuitive, lets you peek without inflating false positives, and handles low-traffic scenarios better. Modern platforms (Statsig, Eppo, GrowthBook) default to Bayesian or sequential testing. For a new program: pick the framework your team understands, not the 'correct' one. Both work if used right.
How do I avoid the peeking problem?
Peeking = checking a frequentist test before reaching pre-calculated sample size. Inflates false positive rate from 5% to 30%+. Solutions: (1) calculate sample size in advance, only check at the end; (2) use sequential testing (always-valid p-values, e.g. mSPRT) which is peek-safe; (3) use Bayesian inference, which is naturally peek-tolerant. Modern platforms handle this automatically β€” old platforms (Google Optimize, simple t-tests) require discipline.
How do I prioritize which tests to run?
ICE (Impact Γ— Confidence Γ— Ease, 1-10 each) for fast triage. PIE (Potential Γ— Importance Γ— Ease) for marketing. RICE (Reach Γ— Impact Γ— Confidence Γ· Effort) for product. The framework matters less than: (1) writing down your reasoning before running; (2) reviewing predictions vs results quarterly; (3) killing low-traffic tests that can't reach significance. Top experimentation programs maintain a backlog of 50-100 ideas, run 20-50/quarter, ship the wins.
Why do most A/B tests come back inconclusive?
Two reasons: (1) effect sizes in mature products are tiny β€” 0.5-2% lifts are typical, requiring tens of thousands of users to detect. (2) tests are often underpowered: sample size calculated for a 10% lift, actual effect is 1%, you need 100x more users. Fix: calculate Minimum Detectable Effect (MDE) before launch, accept that 70%+ of tests will be flat, and treat 'inconclusive' as useful data ('don't waste eng cycles on this idea').
Tool stack: Optimizely vs GrowthBook vs Statsig vs LaunchDarkly?
Optimizely: enterprise, expensive ($30k+/yr), strong WYSIWYG editor, great for marketing teams. GrowthBook: open-source, self-host, dev-friendly, free tier. Statsig: free tier + great Bayesian stats, becoming the modern default. LaunchDarkly: feature flags first, experimentation second; pick if you already have it. For startups: Statsig free tier or GrowthBook. For enterprise: Optimizely or self-hosted GrowthBook. For Posthog stacks: PostHog Experiments built-in.
What's a 'guardrail metric' and why do I need them?
Guardrail = secondary metric that monitors for unintended harm even when primary metric wins. Example: testing a paywall change β€” primary = revenue (up); guardrail = retention 90d (must not drop). Without guardrails, you ship 'wins' that destroy long-term LTV. Standard guardrails: retention, NPS, support tickets, page load time, error rate. Mature programs auto-flag tests where any guardrail moves > 1% even if primary wins.
How big is the salary jump from running tests to running a program?
Practitioner ($80-110k) β€” runs individual tests, picks variants, calculates significance. Strategist ($110-150k) β€” designs prioritization framework, builds test backlog, mentors. Lead ($150-200k+) β€” owns the experimentation platform, sets test velocity targets, runs experimentation guild. The 2-3x salary lift comes from moving from tactical (this test) to systemic (org runs 50 tests/quarter, here's the playbook).

Not sure this skill is for you?

Take a 10-min Career Match β€” we'll suggest the right tracks.

Find my best-fit skills β†’

Find your ideal career path

Skill-based matching across 2,536 careers. Free, ~10 minutes.

Take Career Match β€” free β†’