ab-testing

by @DipakMajhi in Development

# Install this skill:

npx skills add DipakMajhi/product-management-skills --skill "ab-testing"

Install specific skill from multi-skill repository

# Description

Design, size, and analyze A/B tests for product features. Use this skill when someone needs to design an experiment, calculate sample size, check if a test has reached statistical significance, interpret A/B test results, or understand when to ship, iterate, or stop a test. This is commonly tested in PM analytics interviews.

# SKILL.md

name: ab-testing
description: "Design, size, and analyze A/B tests for product features. Use this skill when someone needs to design an experiment, calculate sample size, check if a test has reached statistical significance, interpret A/B test results, or understand when to ship, iterate, or stop a test. This is commonly tested in PM analytics interviews."

A/B Testing Framework

A/B testing is how product teams make decisions with data instead of opinions. You help PMs design rigorous tests and interpret results correctly.

Test Design Checklist

Apply this skill to: $ARGUMENTS

Before running any A/B test:

Hypothesis: "By changing [X], we expect [Y] to increase/decrease by [Z%] because [reasoning]."
Primary metric (OEC): The single Overall Evaluation Criterion this test is designed to move.
Secondary metrics: Metrics to monitor for unintended effects or to explain the "why" behind results.
Guardrail metrics: Metrics that must NOT decline (e.g., latency, crash rate, revenue per user).
Sample unit: What is being randomized? (User, session, device, account, geographic cluster)
Population: Which users will be in the test? (All users? New users only? Paid users? Specific segments?)
Traffic split: 50/50 is standard and maximizes statistical power; justify any other split (e.g., risk mitigation with 90/10 for high-risk changes).
Minimum detectable effect (MDE): The smallest improvement worth deploying, informed by business impact analysis.
Required sample size: Calculate before starting using the formula below.
Test duration: Run for at least one full weekly cycle to account for day-of-week effects; consider novelty and primacy effects.
Interaction check: Identify any other concurrent experiments that could interact with this one.
Pre-launch validation: Run A/A test or verify assignment mechanism distributes traffic evenly.

Sample Size Calculation

Frequentist Approach (Standard)

Formula for proportions:
Required N per variant = (16 x baseline_rate x (1 - baseline_rate)) / (MDE)^2

Where MDE is expressed as an absolute difference (not relative).

Example: Baseline conversion = 5%, MDE = 1 percentage point (20% relative lift)
N = (16 x 0.05 x 0.95) / (0.01)^2 = (0.76) / 0.0001 = 7,600 users per variant

Standard parameters: 95% confidence (alpha = 0.05), 80% statistical power (beta = 0.20).

For continuous metrics (e.g., revenue per user):
N per variant = (16 x variance) / (MDE)^2

Adjustments to Sample Size

Multiple variants: Apply Bonferroni correction. For K variants, use alpha/K as the significance threshold, which increases required N.
Multiple metrics: If testing 5 metrics, expect 1 false positive at p < 0.05 by chance. Apply corrections or pre-designate one primary metric.
Cluster randomization: Multiply N by the design effect factor: DE = 1 + (cluster_size - 1) x ICC, where ICC is the intra-cluster correlation.
CUPED variance reduction: If using pre-experiment covariates, required N decreases by roughly 10-50% depending on covariate correlation.

Duration Estimation

Required days = (Required N per variant x number of variants) / daily eligible traffic
Minimum duration: 7 days (one full weekly cycle). Recommended: 14 days for most tests.

Statistical Approaches

Frequentist Testing (Standard)

Compute p-value: probability of observing the result if there were truly no difference.
Statistically significant at p < 0.05 means less than 5% chance this is a false positive.
Report confidence intervals alongside p-values for practical significance assessment.
Important: p-values assume you commit to a sample size in advance and do not peek.

Bayesian Testing

Expresses results as "probability that variant B outperforms control" (e.g., "94% chance B is better").
More intuitive for stakeholders. Does not require fixed sample sizes.
Requires specifying priors (use weakly informative priors based on historical experiment data).
Best for: early-stage products with limited traffic, continuous monitoring needs, or when incorporating prior experiment results.
Decision rule: Ship if P(B > A) exceeds your threshold (e.g., 95%) AND expected lift exceeds minimum worthwhile effect.

Sequential Testing (Always-Valid Inference)

Allows continuous monitoring without inflating false positive rates.
Uses always-valid p-values or confidence sequences that remain valid at any stopping time.
Methods: mixture Sequential Probability Ratio Test (mSPRT), alpha-spending functions.
Best for: high-velocity teams monitoring tests daily, situations where early stopping saves significant cost.
Tradeoff: Slightly less power than fixed-horizon tests for equivalent sample size.

Advanced Techniques

CUPED (Controlled-experiment Using Pre-Experiment Data)

Uses pre-experiment user behavior data to reduce variance in experiment metrics.
Adjusts the metric: Y_adjusted = Y - theta x (X_pre - mean(X_pre)), where X_pre is the pre-experiment covariate.
Typically improves sensitivity by 10-50%, equivalent to running the test with 20-100% more users.
Requirements: A stable pre-experiment metric that correlates with the experiment metric.
Used by Microsoft, Netflix, Airbnb, and most mature experimentation platforms.

Stratified Sampling and Post-Stratification

Divide users into homogeneous strata (e.g., by platform, country, activity level) before or after assignment.
Reduces variance by removing between-stratum differences from the treatment effect estimate.
Complementary to CUPED and can be stacked with it for additional power gains.

Multi-Armed Bandit (Adaptive Allocation)

Dynamically shifts traffic toward better-performing variants during the experiment.
Lower regret (fewer users exposed to inferior variants) compared to fixed 50/50 split.
Best for: short-lived optimizations (headlines, ad copy, UI variants) where speed matters more than precise measurement.
Not suitable for: tests requiring precise effect size estimates or long-term impact measurement.
Methods: Thompson Sampling, Upper Confidence Bound (UCB), Epsilon-Greedy.

Handling Network Effects and Interference

When users influence each other (social products, marketplaces, messaging):
- Cluster-randomized trials: Randomize at the geographic region, team, or marketplace level instead of the individual user.
- Switchback experiments: Alternate treatment and control across time windows (e.g., hours or days). Useful for marketplace algorithms and pricing experiments.
- Interleaving experiments: For ranking/search tests, interleave results from both variants in the same session. Achieves up to 50x faster iteration on search ranking (used by Airbnb, Netflix).
- Always check for spillover effects between treatment and control groups.

Heterogeneous Treatment Effects (HTE)

Analyze whether the treatment effect differs across user segments (new vs. returning, mobile vs. desktop, geography).
Pre-register subgroup analyses to avoid data-dredging.
Methods: Conditional Average Treatment Effect (CATE) estimation using causal forests, S-learner, T-learner.
Useful for personalization decisions: "Ship this change only for segments where it helps."

Common A/B Testing Mistakes

Peeking and early stopping: Checking results daily and stopping when significant inflates false positive rates from 5% to 20-30%. Use sequential testing or commit to a fixed sample size.
Too many variants: Testing 5+ variants simultaneously requires much larger samples due to multiple comparison corrections.
Wrong randomization unit: Using page views when the correct unit is users leads to correlated observations and invalid inference.
Ignoring seasonality: A test running only on weekdays may not generalize to weekends. Run at least 1-2 full weekly cycles.
Underpowered tests: Testing a change affecting 5% of users with a 5% expected lift on a metric with high variance will never reach significance with reasonable traffic.
Ignoring practical significance: A result can be statistically significant but too small to matter. Always assess whether the effect size justifies the implementation cost.
No pre-registration: Deciding hypotheses and metrics after seeing data enables cherry-picking. Document the test plan before launching.
Ignoring novelty and primacy effects: New features may show inflated initial engagement (novelty) or suppressed engagement (resistance to change). Wait for effects to stabilize.
Running concurrent experiments without interaction detection: Overlapping experiments can interact, producing misleading results. Use experiment namespaces or check for interactions.
Confusing correlation with causation in post-hoc analysis: Subgroup findings from a single experiment are hypotheses, not conclusions. Validate with follow-up experiments.

Quasi-Experimental Methods (When A/B Tests Are Not Feasible)

When you cannot randomize (pricing changes, policy changes, partner rollouts):
- Difference-in-Differences (DiD): Compare pre/post changes between treatment and control groups. Requires parallel trends assumption.
- Regression Discontinuity Design (RDD): Exploit sharp eligibility cutoffs (e.g., users above/below a threshold) to estimate causal effects.
- Interrupted Time Series (ITS): Analyze metric trends before and after an intervention when no control group exists.
- Propensity Score Matching: Match treated users with similar untreated users based on observable characteristics.
- Always acknowledge these are weaker than true randomized experiments and flag the assumptions.

Decision Framework

Result	Interpretation	Action
Statistically significant positive on primary, guardrails stable	The change works as intended	Ship it
Statistically significant positive on primary, guardrail degraded	Mixed result with tradeoff	Do not ship. Rethink the approach or accept tradeoff only with explicit stakeholder sign-off
Statistically significant negative on primary	The change hurts	Revert, investigate why, extract learning
Not significant, adequate sample reached	No detectable effect at this MDE	Iterate on hypothesis. Consider: Was the MDE realistic? Was the change impactful enough?
Not significant, insufficient sample	Underpowered test	Run longer, increase traffic, or apply variance reduction (CUPED)
Significant on secondary but not primary	Interesting signal, not actionable alone	Do not ship based on secondary metrics alone. Design follow-up test targeting the secondary finding
HTE analysis shows segment-specific wins	Works for some users, not all	Consider targeted rollout or personalization strategy

Experimentation Maturity and Velocity

Level 1 - Ad Hoc Testing

Individual tests run occasionally. No shared infrastructure.
Action: Standardize hypothesis templates and metric definitions.

Level 2 - Systematic Testing

Regular experimentation with a shared tool (GrowthBook, Statsig, Eppo, LaunchDarkly).
Action: Build a test review process and centralized experiment log.

Level 3 - Scaled Experimentation

Most features launched behind feature flags. Experiment reviews before shipping.
Action: Implement CUPED, sequential testing, and automated experiment diagnostics.

Level 4 - Experimentation Culture

Hundreds of experiments per quarter. Experimentation is the default, not the exception.
Action: Build experiment portfolios, detect interactions, measure experiment velocity as a team metric.

Output Template

A/B TEST DESIGN / ANALYSIS

Test Name: [Name]
Hypothesis: [Full hypothesis statement: By changing X, we expect Y to change by Z% because reasoning]
Statistical Approach: [Frequentist / Bayesian / Sequential]

Design
Primary metric (OEC): [Metric + definition]
Secondary metrics: [List with definitions]
Guardrail metrics: [List with acceptable degradation thresholds]
Sample unit: [User / Session / Account / Cluster]
Population: [Which users, any exclusions]
Traffic split: [Variant A / Variant B % + justification if not 50/50]
Concurrent experiments to check for interaction: [List or "None identified"]

Sample Size Calculation
Baseline rate: [X%]
Minimum detectable effect: [X percentage points (X% relative)]
Confidence level: [95%] | Power: [80%]
Variance reduction method: [None / CUPED / Stratification]
Required N per variant: [Calculated value]
Expected daily eligible traffic: [N users per day]
Required test duration: [N days, minimum 7]

Pre-Launch Checklist
- [ ] Hypothesis and metrics documented and reviewed
- [ ] Sample size calculated and achievable within traffic
- [ ] Randomization unit confirmed
- [ ] Logging and metric collection verified
- [ ] A/A validation or traffic split check planned
- [ ] Interaction with concurrent experiments assessed
- [ ] Success criteria and decision rules agreed upon in advance

Results (if analyzing an existing test)
| Metric | Control | Treatment | Absolute Lift | Relative Lift | p-value / P(B>A) | CI (95%) | Significant? |
|---|---|---|---|---|---|---|---|
| [Primary metric] | X% | Y% | +Z pp | +W% | 0.0X | [L, U] | Yes/No |
| [Secondary metric] | | | | | | | |
| [Guardrail metric] | | | | | | | |

Segment Analysis (if applicable)
| Segment | Control | Treatment | Lift | Significant? |
|---|---|---|---|---|
| [New users] | | | | |
| [Returning users] | | | | |
| [Mobile] | | | | |

Decision
[Ship / Ship to segment / Iterate / Stop + clear reasoning tied to pre-registered criteria]

What We Learned
[2-3 sentences on insights from this test, including implications for future experiments, regardless of outcome]

Follow-Up Actions
[Next experiment to run, features to investigate, or changes to test design]

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.