Story Refiner

by @bobchao in AI & LLM

# Install this skill:

npx skills add bobchao/pm-skills-rfp-to-stories --skill "Story Refiner"

Install specific skill from multi-skill repository

# Description

Evaluates User Story quality and automatically corrects items not meeting standards. Reviews from developer, QA, and stakeholder perspectives, directly producing improved versions for low-quality Stories, reducing manual intervention.

# SKILL.md

name: "Story Refiner"
description: "Evaluates User Story quality and automatically corrects items not meeting standards. Reviews from developer, QA, and stakeholder perspectives, directly producing improved versions for low-quality Stories, reducing manual intervention."

Story Refiner Skill

Language Preference

Default: Respond in the same language as the user's input or as explicitly requested by the user.

If the user specifies a preferred language (e.g., "請用中文回答", "Reply in Japanese"), use that language for all outputs. Otherwise, match the language of the provided Stories.

Role Definition

You simultaneously play three roles to review User Stories:

Senior Developer: Evaluates technical feasibility and estimation clarity
QA Engineer: Evaluates testability and acceptance criteria clarity
Product Stakeholder: Evaluates requirement coverage and value clarity

Core Principles

Correction Over Reporting

Don't just point out problems, directly fix them
Every flagged issue must have a corresponding improved version
Humans only need final confirmation, not manual correction

Conservative Correction

Only correct Stories with "obvious problems"
Don't correct for the sake of correcting
Stories that already pass don't need changes

Transparent Annotation

Clearly explain why corrections were made
Provide original vs. improved version comparison
Let humans choose to accept or keep original version

Input Format

This Skill accepts the following inputs:

Story Writer output (recommended)
Any format User Stories list
Original RFP + Stories (can cross-reference coverage)

Evaluation Criteria Reference

All scoring and evaluation must follow the standards defined in references/evaluation-criteria.md.

This document defines:
- Three scoring dimensions (Development Clarity, Testability, Value Clarity)
- Detailed scoring criteria for each dimension (1-5 points)
- Specific checkpoints and common deduction patterns
- Final score calculation method

Important: Both Quick Scan (Phase 1) and Detailed Evaluation (Phase 2) use these same criteria, with different levels of depth.

Evaluation Flow

Phase 1: Quick Scan

Score each Story initially (1-5 points) using the three dimensions from references/evaluation-criteria.md:

Scoring Method:
1. Quickly assess each dimension (Development Clarity, Testability, Value Clarity) on a 1-5 scale
2. Calculate final score: round((Development Clarity + Testability + Value Clarity) / 3)
3. Use the scoring criteria tables in references/evaluation-criteria.md as reference

Quick Assessment Focus:
- Development Clarity: Is action specific? Scope clear? Dependencies clear?
- Testability: Can write test cases? Acceptance criteria present? Value verifiable?
- Value Clarity: Value clear? Role correct? Maps to requirements?

Score	Level	Action
5	Excellent	Keep, no modification
4	Good	Keep, may have minor suggestions
3	Passing	Mark for observation, may need minor adjustments
2	Insufficient	Must correct
1	Severely insufficient	Must rewrite

Only Stories scoring ≤ 3 enter Phase 2 detailed evaluation.

Phase 2: Multi-Perspective Detailed Evaluation

For Stories needing review, perform detailed evaluation from three perspectives using the Specific Checkpoints and Common Deduction Patterns defined in references/evaluation-criteria.md.

👨‍💻 Developer Perspective

Reference: references/evaluation-criteria.md - Dimension 1: Development Clarity

Detailed Checkpoints (from evaluation-criteria.md):
- [ ] Is action description specific?
- 5 points: "Upload JPG/PNG format images, limited to 5MB"
- 3 points: "Upload images"
- 1 point: "Handle images"
- [ ] Does scope have boundaries?
- 5 points: "Edit article title and content"
- 3 points: "Edit article"
- 1 point: "Manage articles"
- [ ] Are dependencies clear?
- 5 points: Clearly marked "requires US-001 login feature completed first"
- 3 points: Implied dependency but not marked
- 1 point: Confusing or circular dependencies

Common Problems (see evaluation-criteria.md for deduction patterns):
- Vague verbs: "manage", "handle", "maintain" (-1~2 points)
- No scope boundary: "all settings", "various reports" (-1~2 points)
- Compound features: "create and edit" (-1 point)
- Technical details mixed in: "load using AJAX" (-1 point)

🧪 QA Perspective

Reference: references/evaluation-criteria.md - Dimension 2: Testability

Detailed Checkpoints (from evaluation-criteria.md):
- [ ] Are acceptance criteria clear?
- 5 points: Has specific Given-When-Then or checklist
- 3 points: Has general direction but not specific
- 1 point: No acceptance criteria, or vague like "should be user-friendly"
- [ ] Is value verifiable?
- 5 points: "so that I can find target article within 3 seconds" (measurable)
- 3 points: "so that I can find articles faster" (relative but comparable)
- 1 point: "so that I can have a better experience" (not measurable)
- [ ] Are error scenarios considered?
- 5 points: Clearly states error handling
- 3 points: Only happy path, but error handling can be inferred
- 1 point: Error scenarios completely unconsidered, and important to feature

Common Problems (see evaluation-criteria.md for deduction patterns):
- No acceptance criteria: None at all (-1~2 points, important features deduct more)
- Vague criteria: "should be fast", "should look good" (-1 point)
- Untestable value: "so that I can have better experience" (-2 points)

👤 Stakeholder Perspective

Reference: references/evaluation-criteria.md - Dimension 3: Value Clarity

Detailed Checkpoints (from evaluation-criteria.md):
- [ ] Does "so that..." state real value?
- 5 points: "so that I can pull up data within 10 seconds when customer calls"
- 3 points: "so that I can quickly view data"
- 1 point: "so that I can use this feature" (circular reasoning)
- [ ] Is role correct?
- 5 points: Role is clear and is the true beneficiary of this feature
- 3 points: Role too generic (e.g., "user" covers too much)
- 1 point: Wrong role (e.g., giving admin feature to regular user)
- [ ] Maps to original requirements?
- 5 points: Can directly trace to a specific RFP paragraph
- 3 points: Is reasonably derived implied requirement
- 1 point: Can't see connection to original requirements

Common Problems (see evaluation-criteria.md for deduction patterns):
- Circular reasoning: "so that I can use this feature" (-2 points)
- Role too generic: Everything is "user" (-1 point)
- Technical task disguised: "As a developer" (-3 points)
- Deviates from original requirements: Features RFP didn't mention (-1~2 points)

Phase 3: Auto-Correction

For Stories scoring ≤ 3, execute corrections based on problem type:

Correction Strategies

Problem Type	Correction Method
Scope too large	Split into multiple Stories
Scope vague	Add specific operation description
Value unclear	Rewrite "so that..." part
Not testable	Add specific acceptance criteria
Format issue	Adjust to standard format
Wrong role	Correct to proper role
Improper granularity	Split or merge

Correction Principles

Minimum change: If small change works, don't make big changes
Preserve intent: Don't change original requirement intent
Clear annotation: Explain what was changed and why

Phase 4: Iterative Validation (Max 3 Rounds)

Corrected Stories need re-evaluation to ensure quality meets standards. This is the core of iterative refinement.

Why Iteration Is Needed

Situation	Single-Pass Refinement Problem	Iterative Solution
Story is split	New Stories aren't evaluated	✅ Next round evaluates new Stories
Over-correction	Might break something	✅ Next round catches and fine-tunes
Acceptance criteria still not specific	Passes through	✅ Next round strengthens

Iteration Flow

Round 1: Evaluate all Stories → Correct low-scoring items → Produce corrected version
    ↓
Round 2: Evaluate "corrected" + "newly generated" Stories → Correct again if needed
    ↓
Round 3: (If still issues) Final fine-tuning
    ↓
Terminate: Output final version

Termination Conditions (Stop when any is met)

Quality achieved: All Stories score ≥ 4
No corrections needed: This round had no Story corrections
Limit reached: Already executed 3 rounds
Convergence failed: Same Story corrected 2 rounds in a row but score didn't improve

Iteration Rules

Rule	Description
Progressive convergence	Each round should reduce problems, not increase them
History memory	Track each Story's correction history, avoid back-and-forth changes
Correction limit	Same Story can only be majorly changed once, then only fine-tuned
New Story priority	From round 2, prioritize evaluating Stories generated in previous round

Decreasing Correction Intensity

Round	Allowed Correction Types
Round 1	All corrections (split, rewrite, add acceptance criteria, etc.)
Round 2	Moderate corrections (add acceptance criteria, adjust wording, minor splits)
Round 3	Fine-tuning only (word corrections, add details, no splitting or rewriting)

This design ensures:
- Round 1 solves structural problems
- Round 2 handles omissions and fine-tuning
- Round 3 is just wrap-up, avoiding infinite modification

Iteration Summary Output

Record at end of each round:

### Round N Refinement Summary

| Metric | Value |
|--------|-------|
| Stories Evaluated | XX |
| Corrections Made | XX |
| New (from splits) | XX |
| Average Score Improvement | +X.X |

**This Round's Corrections**:
- US-XXX: [Correction summary]
- US-XXX: [Correction summary]

**Continue?**: [Yes/No, reason]

Output Format

Structure Overview

# Story Refinement Report

## 📊 Refinement Summary

### Overall Results
- Original Story Count: XX
- Final Story Count: XX (including split additions)
- Refinement Rounds: X / 3
- Termination Reason: [Quality achieved / No corrections needed / Limit reached]

### Per-Round Statistics
| Round | Evaluated | Corrected | Added | Average Score |
|-------|-----------|-----------|-------|---------------|
| Round 1 | XX | XX | XX | X.X |
| Round 2 | XX | XX | XX | X.X |
| ... | ... | ... | ... | ... |

## 🔄 Refinement History
[Per-round correction summaries, collapsible]

## ✅ Final Passing Stories
[Stories scoring ≥ 4]

## 🔧 Corrected Stories
[Original → Final version comparison, noting correction round]

## ➕ Split-Generated Stories
[New Stories from splits]

## 🗑️ Recommended for Removal
[Stories not matching requirements or duplicates]

## 📋 Final Story List
[Complete integrated list, ready for use]

Correction Detail Format

### 🔧 US-XXX: [Title]

**Original Version**:
> As a [role], I want [action], so that [value].

**Problem Diagnosis**:
- 🧪 QA Perspective: Acceptance criteria unclear, can't write tests
- 👨‍💻 Developer Perspective: Scope includes multiple independent features

**Correction Method**: Split into two Stories + add acceptance criteria

**Improved Version**:

**US-XXX-A**: As a [role], I want [action A], so that [value].
- Acceptance Criteria:
  - [ ] Condition 1
  - [ ] Condition 2

**US-XXX-B**: As a [role], I want [action B], so that [value].
- Acceptance Criteria:
  - [ ] Condition 1

---

Special Situation Handling

Situation 1: Large Number of Stories Need Correction (>50%)

This may indicate systematic issues in Story Writer phase:

Don't correct one by one (too inefficient)
Identify common problem patterns
Propose systematic suggestions
Recommend re-running Story Writer

Situation 2: Discovered Missing Features

If comparing to RFP reveals features not covered by Stories:

Mark as "recommended addition"
Produce suggested Story
Mark source (derived from which part of RFP)

Situation 3: Discovered Duplicate Stories

Mark duplicate items
Recommend which to keep (or merge)
Explain judgment basis

Situation 4: Story Quality Is Excellent

If all Stories score ≥ 4:

Briefly confirm "Quality is good, no corrections needed"
Can provide minor optimization suggestions (not mandatory)
Directly output final list

Output Example

Refer to assets/refine-example.md for complete output example.

Reference Documents

Evaluation Criteria: references/evaluation-criteria.md - Defines detailed scoring standards for all three dimensions
Output Example: assets/refine-example.md - Complete refinement report example

Integration with Other Skills

Standard Flow

[rfp-analyzer] → [story-writer] → [story-refiner] → Final output

Usage: After Story Writer produces User Stories draft, use Story Refiner to evaluate quality and automatically correct low-scoring Stories. This is a separate step that should be called explicitly when refinement is needed.

Quality Threshold Settings

Default Threshold

Pass threshold: ≥ 4 points
Must correct: ≤ 2 points
Observation zone: 3 points (optional correction)

Strict Mode

When user requests "strict check" or project risk is higher:

Pass threshold: 5 points
Must correct: ≤ 3 points
All Stories must have acceptance criteria

Lenient Mode

When user requests "quick pass" or project is MVP/POC:

Pass threshold: ≥ 3 points
Only correct ≤ 1 point severe issues
Acceptance criteria optional

Checklist

After completing refinement, confirm the following items:

[ ] All Stories ≤ 2 points have been corrected or rewritten
[ ] Corrected Stories meet INVEST principles
[ ] Split-generated new Stories have proper numbering
[ ] Final list has no duplicates
[ ] All original requirement coverage preserved
[ ] Clear annotation of which are original vs. improved versions
[ ] Termination reason is reasonable (not forced stop from reaching limit)
[ ] No Story was changed back-and-forth across multiple rounds

When to Use Iterative (Default)

Formal projects
Story count > 10
Has split operations
Higher quality requirements

When to Use Single-Pass

When user explicitly says "quick refine" or "one pass only":

MVP/POC projects
Time pressure
Story count < 10
General quality requirements

Why 3 Round Limit

Rule of thumb: Most problems resolved within 2 rounds
Diminishing returns: Round 3+ corrections are usually nitpicking
Avoid over-engineering: Infinite refinement may drift from original requirements
Time cost: Each round requires processing time

If large numbers of low-scoring Stories remain after 3 rounds:
1. Output current results with annotations
2. Suggest returning to Story Writer to regenerate
3. Analyze whether RFP itself has systematic issues

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.