Skip to main content
JobCannon
All skills

A/B Testing & Experimentation

Test hypotheses, measure impact, optimize conversions

β¬’ TIER 3Industry
+$15k-
Salary impact
4 months
Time to learn
Medium
Difficulty
12
Careers
TL;DR

A/B Testing Experimentation is the hands-on practitioner skill: design testable hypotheses, execute variants (A/B/multivariate), calculate sample size, monitor for false positives, interpret results correctly, and know when to stop. Core difference from Strategy: strategists design the program roadmap; experimenters execute individual tests with statistical validity. Focus: frequentist p-values vs Bayesian credible intervals, peeking penalties, multi-armed bandits (MAB), sequential testing, guardrail metrics, Minimum Detectable Effect (MDE). Career path: test runner (run tests, implement variants, 18-24 months) β†’ senior experimenter (MDE calculation, experiment design review, mentoring, $100-130k) within 2-3 years. Built on statistics (sample size, power analysis, confidence intervals) and tooling (Statsig, GrowthBook, Eppo, VWO, Optimizely).

What is A/B Testing & Experimentation

A/B testing = scientific method for product/marketing decisions. Core skill for growth, product, marketing roles. Boost: +$15k-$40k

πŸ”§ TOOLS & ECOSYSTEM
StatsigGrowthBookEppoVWOOptimizelyLaunchDarklySplit.ioAmplitude ExperimentPython (scipy.stats)Google Sheets (statistical templates)

πŸ“‹ Before you start

πŸ’° Salary by region

RegionJuniorMidSenior
USA$80k$110k$150k
UKΒ£50kΒ£70kΒ£95k
EU€55k€75k€105k
CANADAC$85kC$115kC$160k

❓ FAQ

How do I calculate sample size before running a test?
Use the formula: n = (2σ²(z_Ξ± + z_Ξ²)Β²) / (δ²) where Οƒ is pooled variance, z_Ξ± and z_Ξ² are critical values (1.96 for 95%, 0.84 for 80% power), and Ξ΄ is your Minimum Detectable Effect. Most tools auto-calculate based on baseline conversion, expected lift %, and significance level. Common mistake: assume 10% lift when real products show 1-2%; this requires 100x more users. Tip: always calculate MDE first, reject tests that need >1M users to run.
Peeking at results β€” why is it dangerous, and how do I avoid it?
Peeking (checking a frequentist test before pre-calculated sample size) inflates false positive rate from 5% to 30%+ because you're running multiple hypothesis tests. Solutions: (1) freeze the sample size in advance and only check once; (2) use sequential testing (mSPRT, always-valid p-values) which is peek-safe by design; (3) use Bayesian inference, naturally tolerant of peeking. Modern platforms (Statsig, Eppo, GrowthBook) default to sequential or Bayesian β€” old tools require discipline. Budget: assume 3-5% time to resist peeking and enforce fixed end dates.
Bayesian vs Frequentist for my test β€” which do I choose?
Frequentist (p-values, fixed sample size): easiest mental model, what most companies use historically, requires pre-registration. Bayesian (posterior probability of B > A): more intuitive ("95% chance B is better"), peek-safe, excellent for low-traffic scenarios. Pick based on: (1) your team's math comfort and (2) traffic volume. Low-traffic tests (<1k/day): Bayesian wins. High-traffic tests (>10k/day): either works. Consistency beats perfection; pick one framework, stick to it, avoid context-switching.
What guardrails should I monitor in every test?
Minimum guardrails: retention (don't kill long-term LTV), key engagement metrics (session duration, DAU), and technical health (error rate, load time, null cohorts). Secondary guardrails: NPS, support tickets, downstream revenue events. Set guardrail thresholds at experiment start (e.g., retention drop >2% = auto-stop test). Flag auto-wins where a guardrail moves 1%+ even if primary metric wins β€” this catches hidden regressions. Example: paywall test lifts revenue 15% but 90-day retention drops 5% β†’ ship? No, dig deeper.
When should I use multi-armed bandits (MAB) instead of A/B testing?
A/B test: fixed allocation (50/50), predefined sample size, one-time decision at the end. MAB: dynamic allocation toward the best variant as it emerges, lower cost in converging traffic. Use MAB when: (1) you can't afford to run 50% traffic on a proven-bad variant (expensive ML model), (2) you have many arms (8+) and want to deprioritize losers fast, (3) iteration speed matters more than statistical certainty. Downside: harder to interpret results, can't compare variants directly. Example: tuning a recommendation algorithm on 50 variants β€” MAB beats A/B test by 20-30% efficiency.
How do I know if my result is statistically significant vs practically significant?
Statistical significance = p < 0.05, meaning there's <5% chance the result is due to random noise. Practical significance = the effect size is big enough to care about business. Example: 100k users, +0.1% conversion lift = statistically sig (p=0.03) but costs $500k to implement for $2k/month revenue β€” not worth it. Always report both: (1) p-value or confidence interval and (2) absolute lift dollars or retention points. Set a Minimum Detectable Effect threshold at test start (e.g., only ship if +1% or better) to avoid chasing tiny wins.
How big is the salary jump from running individual tests to test design/review?
Test runner ($70-95k) β€” executes tests, calculates sample size, reports results. Senior experimenter ($100-130k) β€” designs test roadmaps, reviews other experimenters' work, handles statistical edge cases, builds experiment templates. Lead ($130-160k+) β€” owns velocity targets, platform strategy, causal inference. The 40-60% jump comes from moving tactical (this one test) to systemic (every test in the org meets standard). Fastest way up: mentor 2-3 junior testers + build a reusable statistical framework for your company.

Not sure this skill is for you?

Take a 10-min Career Match β€” we'll suggest the right tracks.

Find my best-fit skills β†’

Find your ideal career path

Skill-based matching across 2,536 careers. Free, ~10 minutes.

Take Career Match β€” free β†’