Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add JasonLo/skill-sommelier --skill "ss-skill-tune"
Install specific skill from multi-skill repository
# Description
>-
# SKILL.md
name: ss-skill-tune
description: >-
Self-improving skill optimization using the Karpathy autoresearch pattern.
Runs a skill repeatedly, evaluates outputs against binary criteria, mutates
the SKILL.md to keep winners, and loops until convergence. Use when the user
wants to tune a skill, optimize a skill, improve skill quality with evals,
auto-tune a prompt, run autoresearch, benchmark a skill, or self-improve a
skill. Triggers on "tune skill", "skill tune", "autoresearch", "optimize
skill", "auto-tune", "eval loop", "self-improving", "benchmark skill",
"run evals on skill".
allowed-tools:
- Bash
- Read
- Write
- Edit
- Glob
- Grep
- Agent
Autoresearch — Self-Improving Skill Optimization
Applies the Karpathy autoresearch pattern to any Claude Code skill: generate outputs, evaluate against binary criteria, keep winners, mutate the prompt, repeat.
When to Use
- Optimizing a skill's instructions for reliability
- A skill produces inconsistent outputs and needs tuning
- Setting up automated eval loops for a skill
- User says "autoresearch", "optimize this skill", "run evals"
When NOT to Use
- Creating a new skill from scratch — use
ss-skill-craft - One-time skill review without iteration — use
ss-skill-craftimprove mode - The skill has no measurable output (pure side-effect skills like git helpers)
Phase 1 — Select Target Skill
Entry: User wants to optimize a skill.
- If user specifies a skill, read its SKILL.md
- Otherwise, list all skills in
skills/and ask which to optimize - Read the target SKILL.md fully — understand what it does, its phases, and expected outputs
Exit: Target skill path confirmed, SKILL.md contents understood.
Phase 2 — Define Eval Criteria
Entry: Target skill selected.
Guide the user to define 3-6 binary (Yes/No) evaluation criteria. These must be:
- Binary — PASS or FAIL, no scales. Scales introduce variability and compound probabilities.
- Observable — Claude can judge from the output alone (no side-effect checking)
- Independent — Each criterion tests one thing
- Not over-constrained — Avoid narrow rules (exact word counts, specific phrases) that the model can game without improving quality
Present criteria as a table for confirmation:
| # | Name | Question (Yes = PASS) |
|---|---|---|
| 1 | legible | Is all text clear, correctly spelled, and grammatical? |
| 2 | structured | Does the output follow a clear logical structure? |
| ... | ... | ... |
Ask the user to confirm or adjust before proceeding.
Exit: User-confirmed list of binary eval criteria.
Phase 3 — Define Test Prompts
Entry: Eval criteria confirmed.
- Draft 5-10 diverse test prompts that trigger the target skill
- Cover different scenarios the skill should handle
- Include edge cases if relevant
- Present to user for confirmation
Good test prompts:
- Exercise different code paths in the skill
- Vary in complexity (simple → complex)
- Represent real usage patterns
Exit: User-confirmed list of test prompts.
Phase 4 — Create Config and Setup
Entry: Criteria and test prompts confirmed.
- Create the config and data directories under the skill-tune skill:
skills/ss-skill-tune/runs/<target-skill>/
config.json # Criteria, test prompts, settings
data/
state.json # Run number, best score
results.jsonl # Append-only experiment log
outputs/ # Raw outputs per run
run_001/
run_002/
- Write
config.json:
{
"skill_path": "skills/<name>/SKILL.md",
"eval_criteria": [
{
"name": "criterion_name",
"question": "Is the output X? (Yes = PASS)"
}
],
"test_prompts": [
"prompt 1",
"prompt 2"
],
"batch_size": 5,
"cycle_seconds": 120,
"max_cycles": 10,
"eval_model": "sonnet",
"mutate_model": "sonnet"
}
- Initialize
data/state.json:{"best_score": -1, "run_number": 0}
Exit: Config file written, directory structure created, user confirms settings.
Phase 5 — Run Optimization Loop
Entry: Config and directory structure ready.
Run the autoresearch script:
python3 skills/ss-skill-tune/scripts/autoresearch.py skills/ss-skill-tune/runs/<target-skill>/config.json
Options:
- --once — single cycle (good for testing setup)
- --cycles N — run N cycles
- No flag — run until max_cycles from config
Each cycle:
1. Generate — Run Claude with the target SKILL.md as context + each test prompt via claude -p CLI (uses your current session auth, no API key needed). Save raw outputs.
2. Evaluate — For each output, ask Claude to judge against every binary criterion. Score = total PASSes across all outputs × all criteria.
3. Compare — If score > best_score, keep this SKILL.md version as the new best. Otherwise discard.
4. Mutate — Feed Claude the current best SKILL.md, the scores per criterion, and common failures. Claude rewrites the skill instructions to fix weaknesses. Save as the new candidate.
5. Log — Append run results to results.jsonl.
6. Wait — Sleep until next cycle.
Exit: Loop completes (max_cycles reached, perfect score, or user stops).
Phase 6 — Review Results
Entry: Optimization loop finished.
- Read
results.jsonland summarize: - Starting score vs final best score
- Per-criterion improvement breakdown
- Number of runs kept vs discarded
-
Score progression over cycles
-
Show the diff between original SKILL.md and optimized version
- Ask user whether to:
- Accept — replace original SKILL.md with optimized version
- Inspect — show the full optimized SKILL.md for manual review
- Revert — discard and keep original
- Continue — run more cycles
Exit: User decides on final SKILL.md version.
Phase 7 — Optional Dashboard
If the user wants live monitoring during the loop:
python3 skills/ss-skill-tune/scripts/dashboard.py skills/ss-skill-tune/runs/<target-skill>/data --port 8501
Serves a live dashboard at http://localhost:8501 with:
- Score-over-time chart with keep/discard coloring
- Per-criterion breakdown charts
- Run history table
- Current best prompt display
- Auto-refreshes every 15s
Key Principles
- Binary evals only — No 1-10 scales. Binary criteria produce stable, comparable scores across runs.
- Always mutate from the best — Never mutate from a losing candidate. Always use the highest-scoring SKILL.md as the mutation base.
- Keep it broad — Over-constrained criteria lead to gaming (the model satisfies the letter of the rule without improving quality).
- Diverse test prompts — A skill optimized on one prompt will overfit. Use 5-10 diverse prompts per batch.
- Know when to stop — Diminishing returns are real. If scores plateau for 3+ cycles, stop and review.
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.