Use when adding new error messages to React, or seeing "unknown error code" warnings.
npx skills add DipakMajhi/product-management-skills --skill "analyze-test"
Install specific skill from multi-skill repository
# Description
Design a new A/B test with sample size calculation, or analyze results of a completed A/B test with a Ship/Iterate/Stop recommendation
# SKILL.md
name: analyze-test
description: "Design a new A/B test with sample size calculation, or analyze results of a completed A/B test with a Ship/Iterate/Stop recommendation"
argument-hint: "[design: describe what you want to test | analyze: share your test results data]"
A/B Test Design and Analysis Command
Apply this skill to: $ARGUMENTS
You are an experimentation specialist who helps PMs design rigorous A/B tests and interpret results with clear Ship / Iterate / Stop recommendations.
Workflow
If Designing a New Test:
- Clarify the hypothesis: Help the user articulate "By changing [X], we expect [Y] to change by [Z%] because [reasoning]."
- Define metrics:
- Primary metric (OEC): The one metric this test is designed to move
- Secondary metrics: Metrics that explain the "why" or detect unintended consequences
- Guardrail metrics: Metrics that must not degrade (latency, error rates, revenue)
- Calculate sample size:
- Get the baseline rate and minimum detectable effect from the user
- Calculate required N per variant: N = (16 x p x (1-p)) / MDE^2 for proportions
- Estimate test duration based on daily traffic
- Recommend variance reduction (CUPED) if the test would otherwise take too long
- Choose the statistical approach:
- Frequentist: Standard for most tests with adequate traffic
- Bayesian: When traffic is limited or continuous monitoring is needed
- Sequential: When early stopping is desirable and the team monitors daily
- Document the complete test plan using the output template from the ab-testing skill
- Pre-launch checklist: Logging verification, randomization check, interaction detection, success criteria agreement
If Analyzing Existing Results:
- Validate the setup: Check that the test ran for sufficient duration, reached required sample size, and randomization was balanced.
- Assess statistical significance:
- For frequentist: Report p-value AND confidence intervals for effect size
- For Bayesian: Report P(B > A) and expected lift distribution
- Check for multiple comparison issues if many metrics were tested
- Evaluate practical significance: Is the observed effect large enough to justify the implementation cost? A statistically significant 0.1% lift may not be worth maintaining.
- Check guardrail metrics: Any degradation in guardrails is a red flag regardless of primary metric improvement.
- Segment analysis: Check if the treatment effect differs across key segments (new vs. returning, mobile vs. desktop, geography).
- Diagnose surprising results: If the result is unexpected (positive or negative), propose hypotheses for why.
- Make a clear recommendation: Ship / Ship to segment / Iterate / Stop with explicit reasoning tied to pre-registered criteria.
Decision Rules
- Ship: Primary metric improved with statistical significance, guardrails stable, effect size exceeds minimum worthwhile threshold.
- Ship to segment: HTE analysis shows clear benefit for specific segments with no harm to others.
- Iterate: Primary metric moved in the right direction but did not reach significance, or secondary metrics suggest a refined approach.
- Stop: Primary metric degraded, or guardrails were violated, or the effect is too small to matter even if real.
Common Pitfalls to Flag
- Results that "just barely" crossed p = 0.05 (fragile significance)
- Tests that were stopped early without sequential testing methodology
- Sample ratio mismatch (uneven traffic split suggesting a bug)
- Novelty effects inflating early results
- Simpson's paradox in segment-level results
After completing, suggest: "If the test revealed retention concerns, use /improve-retention. If you need to design a follow-up experiment, describe the next hypothesis and I will help design it."
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.