Use when adding new error messages to React, or seeing "unknown error code" warnings.
npx skills add halay08/fullstack-agent-skills --skill "ab-test-setup"
Install specific skill from multi-skill repository
# Description
Structured guide for setting up A/B tests with mandatory gates for hypothesis, metrics, and execution readiness.
# SKILL.md
name: ab-test-setup
description: Structured guide for setting up A/B tests with mandatory gates for hypothesis, metrics, and execution readiness.
A/B Test Setup
1️⃣ Purpose & Scope
Ensure every A/B test is valid, rigorous, and safe before a single line of code is written.
- Prevents "peeking"
- Enforces statistical power
- Blocks invalid hypotheses
2️⃣ Pre-Requisites
You must have:
- A clear user problem
- Access to an analytics source
- Roughly estimated traffic volume
Hypothesis Quality Checklist
A valid hypothesis includes:
- Observation or evidence
- Single, specific change
- Directional expectation
- Defined audience
- Measurable success criteria
3️⃣ Hypothesis Lock (Hard Gate)
Before designing variants or metrics, you MUST:
- Present the final hypothesis
- Specify:
- Target audience
- Primary metric
- Expected direction of effect
- Minimum Detectable Effect (MDE)
Ask explicitly:
“Is this the final hypothesis we are committing to for this test?”
Do NOT proceed until confirmed.
4️⃣ Assumptions & Validity Check (Mandatory)
Explicitly list assumptions about:
- Traffic stability
- User independence
- Metric reliability
- Randomization quality
- External factors (seasonality, campaigns, releases)
If assumptions are weak or violated:
- Warn the user
- Recommend delaying or redesigning the test
5️⃣ Test Type Selection
Choose the simplest valid test:
- A/B Test – single change, two variants
- A/B/n Test – multiple variants, higher traffic required
- Multivariate Test (MVT) – interaction effects, very high traffic
- Split URL Test – major structural changes
Default to A/B unless there is a clear reason otherwise.
6️⃣ Metrics Definition
Primary Metric (Mandatory)
- Single metric used to evaluate success
- Directly tied to the hypothesis
- Pre-defined and frozen before launch
Secondary Metrics
- Provide context
- Explain why results occurred
- Must not override the primary metric
Guardrail Metrics
- Metrics that must not degrade
- Used to prevent harmful wins
- Trigger test stop if significantly negative
7️⃣ Sample Size & Duration
Define upfront:
- Baseline rate
- MDE
- Significance level (typically 95%)
- Statistical power (typically 80%)
Estimate:
- Required sample size per variant
- Expected test duration
Do NOT proceed without a realistic sample size estimate.
8️⃣ Execution Readiness Gate (Hard Stop)
You may proceed to implementation only if all are true:
- Hypothesis is locked
- Primary metric is frozen
- Sample size is calculated
- Test duration is defined
- Guardrails are set
- Tracking is verified
If any item is missing, stop and resolve it.
Running the Test
During the Test
DO:
- Monitor technical health
- Document external factors
DO NOT:
- Stop early due to “good-looking” results
- Change variants mid-test
- Add new traffic sources
- Redefine success criteria
Analyzing Results
Analysis Discipline
When interpreting results:
- Do NOT generalize beyond the tested population
- Do NOT claim causality beyond the tested change
- Do NOT override guardrail failures
- Separate statistical significance from business judgment
Interpretation Outcomes
| Result | Action |
|---|---|
| Significant positive | Consider rollout |
| Significant negative | Reject variant, document learning |
| Inconclusive | Consider more traffic or bolder change |
| Guardrail failure | Do not ship, even if primary wins |
Documentation & Learning
Test Record (Mandatory)
Document:
- Hypothesis
- Variants
- Metrics
- Sample size vs achieved
- Results
- Decision
- Learnings
- Follow-up ideas
Store records in a shared, searchable location to avoid repeated failures.
Refusal Conditions (Safety)
Refuse to proceed if:
- Baseline rate is unknown and cannot be estimated
- Traffic is insufficient to detect the MDE
- Primary metric is undefined
- Multiple variables are changed without proper design
- Hypothesis cannot be clearly stated
Explain why and recommend next steps.
Key Principles (Non-Negotiable)
- One hypothesis per test
- One primary metric
- Commit before launch
- No peeking
- Learning over winning
- Statistical rigor first
Final Reminder
A/B testing is not about proving ideas right.
It is about learning the truth with confidence.
If you feel tempted to rush, simplify, or “just try it” —
that is the signal to slow down and re-check the design.
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.