knot0-com

vibe-testing

1
0
# Install this skill:
npx skills add knot0-com/vibe-testing

Or install specific skill: npx add-skill https://github.com/knot0-com/vibe-testing

# Description

>

# SKILL.md


name: vibe-testing
description: >
This skill should be used when the user asks to "test my specs",
"validate my design docs", "find gaps in my architecture", "stress-test
the spec", "vibe test", "pressure test the docs", or mentions spec
validation before implementation begins.
version: 1.0.0


Vibe Testing

Overview

Vibe testing validates specification documents by simulating real-world scenarios against them using LLM reasoning. Instead of writing code or test harnesses, write natural-language scenarios that exercise cross-cutting slices of the spec surface β€” then trace execution step-by-step, flagging gaps, conflicts, and ambiguities.

Core principle: If a realistic user scenario cannot be fully traced through the specs, the specs are incomplete.

Best used: After specs are written, before implementation begins.

When to Use

  • Spec docs exist but no implementation yet β€” validate before building
  • After major spec changes β€” regression-test for new gaps
  • Before implementation planning β€” find blocking gaps early
  • Specs span multiple documents β€” test cross-doc coherence
  • Designing for multiple deployment contexts β€” test each context separately

When NOT to use:

  • Single-file specs with obvious scope β€” just review manually
  • Implementation bugs β€” use actual tests
  • API contract validation β€” use schema validation tools

Core Method

1. GATHER    β€” Read all spec docs in the target directory
2. SCENARIOS β€” Write 3-5 vibe test cases (personas + goals + environments)
3. SIMULATE  β€” Trace each scenario step-by-step against the specs
4. CLASSIFY  β€” Tag findings as GAP / CONFLICT / AMBIGUITY
5. SEVERITY  β€” Rate as BLOCKING / DEGRADED / COSMETIC
6. REPORT    β€” Produce gap summary + spec coverage matrix

Writing a Vibe Test Case

Every test case requires 7 sections:

1. Persona (WHO)

A concrete person with a name, role, and technical skill level. Not abstract β€” real enough to predict behavior.

**Sarah** β€” First-time customer. Shopping on mobile during a commute.
Expects checkout to take under 60 seconds. Low patience for errors.

Named personas force specificity. "A customer" invites hand-waving. "Sarah, shopping on mobile during a commute" forces the spec to answer "what happens on a slow 3G connection?"

2. Environment (WHERE)

Deployment mode, hardware, network, access method. Different environments exercise different spec paths.

- **Client:** Mobile browser (iOS Safari, 3G connection)
- **Backend:** Microservices (auth, payments, inventory, orders, notifications)
- **Scale:** Black Friday traffic β€” 50x normal load

3. Goal (WHAT)

A single sentence in the persona's own words. Use a blockquote.

> "I want to buy these 3 items, pay with my credit card, and get a
> confirmation email within a minute."

4. Scenario Steps (HOW)

5-8 concrete steps the persona takes. Each step names:

  • The user action β€” what they do
  • The primitives exercised β€” which spec concepts activate
  • Gap detection questions β€” 2-3 questions the simulator must answer
#### Step 3: Payment fails, customer retries

Sarah's first payment attempt is declined. She re-enters a different card.

**Primitives:**
- `payments-spec.md`: retry policy, idempotency keys
- `inventory-spec.md`: stock hold duration during retry
- `orders-spec.md`: order state transitions on payment failure

**Questions:**
- Q3.1: The payment spec says "retry 3 times." The inventory spec
  holds stock for 5 minutes. What if retries take longer than 5 minutes?
- Q3.2: Does the order stay in "pending_payment" during retries, or
  does it transition to "failed" and require a new order?

Rules for good steps:

  • Each step must cite at least one spec doc
  • Each step must ask at least one question the spec should answer
  • Questions use Q<step>.<number>: format for traceability
  • Questions must be spec-answerable (yes/no/how), not opinion questions

5. Spec Coverage Matrix (COVERAGE)

A table showing which spec docs were exercised at which steps.

| Spec Doc | Steps Hit | Coverage |
|----------|-----------|----------|
| `payments-spec.md` | 3,4 | Retry covered; hold-vs-retry timing gap |
| `inventory-spec.md` | 2,3 | Stock hold covered; expiry-during-retry unclear |
| `shipping-spec.md` | β€” | Not exercised |

Specs that no scenario touches are untested blind spots.

6. Gap Detection Questions Summary

Collect all Q-numbers for easy reference. The simulator answers every one.

7. Gap Classification (After Simulation)

Classify each finding by severity:

Severity Definition Example
BLOCKING Spec cannot answer; implementation impossible Payment retry duration can exceed inventory hold β€” no resolution defined
DEGRADED Spec is silent but a workaround exists No spec for partial refunds on split shipments; can process manually
COSMETIC Missing convenience, not a correctness issue No order timeline view for customer support

Running a Vibe Test

Use as a prompt to a subagent or fresh LLM context with full spec access:

You are a spec validation simulator. You have been given all
specification documents for [system name].

Read the following vibe test case. Simulate executing the scenario
step by step against the specs.

For each step:
1. Identify the governing spec document and section
2. Trace the data flow through the system primitives
3. Answer every Q-numbered question by citing the spec

For each question, classify as:
- COVERED: The spec answers this clearly. Cite the section.
- GAP: The spec is silent. No document addresses this.
- CONFLICT: Two specs give contradictory answers. Cite both.
- AMBIGUITY: The spec addresses this but the answer is unclear.

After all steps, produce:
- Gap summary table (ID, description, severity, affected steps)
- Spec coverage heatmap (which docs exercised, which not)
- Recommended spec changes (which doc to update, what to add)

Batch Execution

Run all test cases and aggregate:

for each test case:
    1. Load all spec docs as context
    2. Load one test case
    3. Run simulator prompt
    4. Collect gap report

Aggregate:
    - Cross-test gap summary (gaps appearing in multiple tests)
    - Spec coverage union (docs never exercised by any test)
    - Priority ranking (blocking > degraded > cosmetic)

Regression Testing

After spec updates, re-run all vibe tests to verify:

  1. Previously identified gaps are now COVERED
  2. No new gaps were introduced
  3. Cross-doc references remain consistent

Designing Good Test Cases

Scenario Selection Strategy

Choose scenarios that vary across dimensions:

Dimension Variation A Variation B Variation C
User type First-time buyer Returning customer Admin/merchant
Device Mobile browser Desktop API client
Scale Single user Normal traffic Black Friday spike
Payment Happy path Failure + retry Partial refund
Governance None (consumer) Moderate (business) Strict (compliance)
Network Fast WiFi Slow 3G Intermittent

Each test case should differ on at least 3 dimensions. 4 test cases covering 4 quadrants give good coverage.

Question Design

Good gap detection questions are:

  • Specific: "What order state is set during payment retry?" not "How do orders work?"
  • Traceable: Answerable by citing a spec section (or flagging its absence)
  • Boundary-probing: Target edges between two specs' responsibilities
  • Scale-sensitive: "What happens with 10,000 concurrent checkouts?"
  • Failure-aware: "What if the payment fails after inventory is reserved?"

Coverage Maximization

After writing all test cases, check the coverage union. Every spec doc should appear in at least one coverage matrix. If a doc is never exercised:

  • Either add a step to an existing test that exercises it
  • Or the doc may be specifying something no real scenario needs (flag for review)

Gap Report Format

## Gap Summary

### BLOCKING
| ID | Gap | Affected Tests | Recommended Fix |
|----|-----|---------------|-----------------|
| G-B1 | Payment retry window can exceed inventory hold | VT-1, VT-2 | Align timing in payments-spec.md and inventory-spec.md |

### DEGRADED
| ID | Gap | Affected Tests | Workaround |
|----|-----|---------------|-----------|
| G-D1 | No spec for partial refunds on split shipments | VT-3 | Process refunds per-shipment manually |

### COSMETIC
| ID | Gap | Affected Tests |
|----|-----|---------------|
| G-C1 | No order timeline view for support agents | VT-4 |

Gap IDs use prefix: G-B (blocking), G-D (degraded), G-C (cosmetic).

Common Mistakes

Mistake Fix
Abstract personas ("a user") Give them names, roles, and constraints
Scenario only tests happy path Add failure steps: "What if the payment is declined?"
Questions test opinions ("Is this good?") Questions must be spec-answerable: "Which doc defines X?"
All tests use same user type Vary across buyer, merchant, admin, support
Ignoring coverage matrix Every spec doc must appear in at least one test
Writing tests after implementation Vibe tests validate specs BEFORE implementation
Too many steps per scenario 5-8 steps. Focused scenarios find more gaps

Additional Resources

  • references/simulator-prompt.md β€” Full simulator prompt template ready to paste
  • examples/example-vibe-test.md β€” Complete example vibe test case

# README.md

Vibe Testing

Pressure-test your specs with LLM reasoning before writing code.

Vibe testing is a technique for validating specification documents by simulating real-world scenarios against them. An LLM reads your spec docs, traces through a concrete user scenario step by step, and flags every gap, conflict, and ambiguity β€” before anyone writes a line of implementation.

Why

We test code obsessively. Unit tests, integration tests, E2E tests. But specifications? We "review" them in a meeting.

Vibe testing moves the discovery of design flaws to the cheapest possible moment: before implementation begins.

How It Works

1. Write a scenario: a named persona, a concrete goal, step-by-step interaction
2. Give an LLM all your spec docs + the scenario
3. The LLM traces each step, identifies governing specs, flags gaps
4. You get a structured gap report with severity ratings

No code. No test harness. Just reasoning.

Install as Agent Skill

Works with Claude Code, OpenAI Codex, Gemini CLI, Cursor, GitHub Copilot, OpenCode, and any tool supporting the Agent Skills open standard.

Claude Code

git clone https://github.com/knot0-com/vibe-testing.git ~/.claude/skills/vibe-testing

OpenAI Codex

git clone https://github.com/knot0-com/vibe-testing.git ~/.codex/skills/vibe-testing

Gemini CLI

git clone https://github.com/knot0-com/vibe-testing.git ~/.gemini/skills/vibe-testing

Universal (works with most agents)

git clone https://github.com/knot0-com/vibe-testing.git ~/.agent/skills/vibe-testing

Project-level (shared with team)

git clone https://github.com/knot0-com/vibe-testing.git .claude/skills/vibe-testing

Usage

Once installed, the skill activates when you ask your coding agent to validate specs:

> /vibe-testing

> "Test my specs against a realistic scenario"

> "Find gaps in the architecture docs before we start building"

> "Vibe test the design docs in docs/v2/"

What's Included

vibe-testing/
β”œβ”€β”€ SKILL.md                          # The skill definition (Agent Skills standard)
β”œβ”€β”€ references/
β”‚   └── simulator-prompt.md           # Copy-paste prompt templates
└── examples/
    └── example-vibe-test.md          # Complete example: e-commerce checkout flow

The Gap Report

Vibe tests produce a structured gap report:

Severity Meaning
BLOCKING Spec cannot answer. Implementation impossible without resolution.
DEGRADED Workaround exists but it's fragile.
COSMETIC Missing convenience. Not a correctness issue.

Example

The included example tests an e-commerce checkout against specs for auth, payments, inventory, orders, notifications, and shipping. A single scenario β€” "first-time buyer, payment declined, retries with new card" β€” found:

  • Payment retry timing exceeds inventory hold β€” stock can be sold to another customer while the buyer is entering a new card number
  • Auth token expires mid-checkout β€” 15-minute JWT TTL vs. potentially longer checkout flow on slow connections
  • Payment succeeds but order confirmation fails β€” customer is charged with no order record (no saga/compensation defined)
  • Guest checkout order access undefined β€” no spec for how a guest views their order status

Each would have been a rewrite-level discovery weeks into implementation.

License

MIT

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.