skill-studio

by @lzw12w in AI & LLM

# Install this skill:

npx skills add lzw12w/agent-skills-studio

Or install specific skill: npx add-skill https://github.com/lzw12w/agent-skills-studio

# Description

Comprehensive skill evaluation and debugging framework for testing agent skills. Use when users need to (1) evaluate a skill's recall rate (how often it triggers correctly), (2) test a skill's accuracy against expected outputs, (3) analyze skill performance with various prompts, or (4) generate improvement recommendations for existing skills. Requires test cases with standard answers.

# SKILL.md

name: skill-studio
description: "Comprehensive skill evaluation and debugging framework for testing agent skills. Use when users need to (1) evaluate a skill's recall rate (how often it triggers correctly), (2) test a skill's accuracy against expected outputs, (3) analyze skill performance with various prompts, or (4) generate improvement recommendations for existing skills. Requires test cases with standard answers."

Skill Studio

A systematic framework for evaluating and debugging agent skills through automated testing of recall rates and output accuracy.

Overview

Skill Studio provides tools and workflows to:
- Test if a skill triggers correctly (recall rate)
- Verify skill outputs match expected results (accuracy rate)
- Evaluate code outputs with specialized metrics (syntax, structure, functionality, quality)
- Generate actionable improvement recommendations
- Track performance across iterations

Core Workflow

Step 1: Collect Test Requirements

Gather from the user:
- Target skill name: The skill being evaluated (e.g., "pptx", "docx", "custom-skill")
- Test cases file: Path to test prompts (or help create one)
- Standard answers file: Path to expected outputs (or help create one)
- Test parameters: Number of prompt variations, testing approach

Step 2: Prepare Test Cases

If test cases don't exist, help create test_cases.json using the format in references/test_format.md.

Key elements:
- Diverse prompts that should trigger the skill
- Negative cases (prompts that should NOT trigger)
- Category labels for analysis
- Expected behaviors

Step 3: Run Recall Testing

Execute recall rate evaluation:

python3 scripts/test_recall.py \
  --skill-name <skill-name> \
  --test-cases <path-to-test-cases.json> \
  --variations 5 \
  --output outputs/recall_results.json

What it does:
- Generates prompt variations for each test case
- Tests if the skill triggers with each variation
- Calculates recall rates overall and by category
- Identifies false positives and false negatives

Key metrics:
- Overall recall rate: % of prompts that correctly triggered
- Per-category recall: Performance by test case type
- False positive rate: Incorrect triggering

Step 4: Run Accuracy Testing

Execute output accuracy evaluation:

python3 scripts/test_accuracy.py \
  --skill-name <skill-name> \
  --test-cases <path-to-test-cases.json> \
  --standard-answers <path-to-standard-answers.json> \
  --workspace <path-to-workspace-directory> \
  --artifacts-dir test_artifacts \
  --output outputs/accuracy_results.json

What it does:
- Runs the skill with test prompts
- Captures file changes (for coding agent skills)
- Compares outputs to standard answers
- Evaluates both exact matches and semantic similarity
- Saves before/after states and diffs
- Categorizes errors by type

Key parameters:
- --workspace: Optional. Path to workspace directory for capturing code changes
- --artifacts-dir: Optional. Directory to save test artifacts (default: test_artifacts)

Key metrics:
- Exact match accuracy: % of perfect matches
- Semantic similarity: Average similarity score (0-1)
- Error categories: Common failure patterns

Artifacts saved (when --workspace is provided):

For each test case, the following artifacts are saved in test_artifacts/<test_id>/:
- metadata.json: Test case information
- before_state.json: File states before skill execution
- after_state.json: File states after skill execution
- changes.json: Summary of added/modified/deleted files
- diffs/<file>.diff: Unified diffs for each modified file
- full_output.json: Complete skill response

This allows you to:
- Review exact code changes made by the skill
- Compare actual changes to expected changes
- Debug why tests failed
- Track how the skill evolves over iterations

For code outputs:
- Syntax validity: Does code parse correctly?
- Structure match: Has expected functions/classes/imports?
- Functionality: Passes test cases?
- Code quality: Comments, docstrings, error handling, type hints

Step 5: Analyze and Generate Recommendations

Run the analysis to get improvement suggestions:

python3 scripts/analyze_results.py \
  --recall-results outputs/recall_results.json \
  --accuracy-results outputs/accuracy_results.json \
  --skill-path <path-to-skill-folder> \
  --output outputs/recommendations.md

Output includes:
- Performance summary with visual indicators
- Specific issues identified (low-performing categories)
- Concrete recommendations for SKILL.md improvements
- Suggested trigger phrases for description
- Priority-ranked action items

Step 6: Implement Improvements

Based on recommendations:
1. Review the generated recommendations document
2. Update the skill's SKILL.md frontmatter description
3. Enhance instructions in SKILL.md body
4. Add clarifying examples or decision trees
5. Re-run tests to validate improvements

Iterate until target performance is achieved.

Interpreting Results

Recall Rate Thresholds

>90%: Excellent - skill triggers reliably
70-90%: Good - minor description improvements helpful
50-70%: Fair - description needs refinement
<50%: Poor - major triggering issues, likely unclear description

Accuracy Rate Thresholds

>85%: Excellent - outputs are high quality
70-85%: Good - minor instruction improvements helpful
55-70%: Fair - instructions need clarification
<55%: Poor - major workflow or instruction issues

Common Issues and Solutions

Low Recall Rate:
- Description is too vague or generic
- Missing key trigger keywords
- Overlaps with other skills' descriptions
- Solution: Add specific file types, actions, or scenarios to description

High False Positive Rate:
- Description is too broad
- Includes common generic terms
- Solution: Be more specific about exact use cases

Low Accuracy Rate:
- Instructions are unclear or incomplete
- Missing critical steps in workflow
- Examples don't match user expectations
- Solution: Add decision trees, more examples, clearer step-by-step guidance

Variable Performance by Category:
- Skill handles some use cases well but not others
- Solution: Add specific guidance for underperforming categories

Best Practices

Test Case Design

Include 20-30 diverse test cases minimum
Cover all major use cases the skill should handle
Add negative cases (should NOT trigger) at ~20% ratio
Use realistic, varied phrasing
Label test cases by category for granular analysis

Standard Answers

Define clear quality criteria
Use semantic similarity for flexible matching when appropriate
Include both structure and content expectations
Specify file formats and key elements

Iterative Testing

Test after each modification to track progress
Focus on lowest-performing categories first
Keep a log of changes and their impact
Aim for incremental improvements (5-10% gains per iteration)

Skill Description Optimization

Include specific file extensions (e.g., ".docx", ".pptx")
List concrete actions (e.g., "creating", "editing", "analyzing")
Mention key scenarios (e.g., "when user uploads", "when user requests")
Use numbered lists for multiple trigger conditions
Avoid generic terms that apply to many skills

Resources

scripts/

test_recall.py: Measures skill triggering consistency
test_accuracy.py: Evaluates output quality
analyze_results.py: Generates improvement recommendations
generate_variations.py: Creates prompt variations for testing

references/

test_format.md: Detailed test case format specification
answer_format.md: Standard answer format specification
code_evaluation.md: Code output evaluation format and best practices
metrics_explained.md: Deep dive into evaluation metrics

# README.md

English | 中文

Skill-Studio 🚀

Skill-Studio is a comprehensive skill for evaluating and debugging Agent Skills. Through automated testing and in-depth analysis, it helps developers improve Skill Recall Rate and result Accuracy, providing targeted recommendations for improvement.

🌟 Key Features

Recall Evaluation: Tests the triggering stability of Skills under various prompt variations.
Accuracy Testing: Compares Skill outputs against standard answers, supporting semantic similarity analysis.
Automated Prompt Variations: Automatically generates multiple prompt variations to simulate different user expressions.
In-depth Analysis & Recommendations: Automatically identifies potential issues and provides optimization suggestions based on test metrics.
Code & State Tracking: Supports capturing file system changes to verify the Skill's impact on the environment.

📂 Project Structure

scripts/: Core evaluation scripts.
test_recall.py: Tests Skill trigger recall rate.
test_accuracy.py: Tests Skill execution accuracy.
analyze_results.py: Analyzes results and generates improvement suggestions.
assets/: Contains test cases and standard answer examples.
references/: Detailed metric definitions, format specifications, and technical documentation.
SKILL.md: Detailed information including Skill descriptions, trigger conditions, input/output formats, etc.

🚀 Quick Start

1. Prepare Test Cases (Optional)

Prepare your test cases in the assets/ directory. Refer to test_format.md for detailed format information.

2. Use the Skill in your Agent conversation to evaluate a Skill:

1. Run Recall Test

Test if the Skill is correctly triggered in expected scenarios:

python3 scripts/test_recall.py --test-cases assets/example_test_cases.json --skill-name my_awesome_skill

2. Run Accuracy Test

Verify if the Skill execution results meet expectations:

python3 scripts/test_accuracy.py --test-cases assets/example_test_cases.json --standard-answers assets/example_standard_answers.json

3. Generate Analysis Report

Analyze test results and get optimization suggestions:

python3 scripts/analyze_results.py --recall-results results/recall_results.json --accuracy-results results/accuracy_results.json

📊 Core Metrics

Recall Rate

Measures how often the Skill triggers when it should.
- Formula: True Positives / (True Positives + False Negatives)
- Goal: > 90%

False Positive Rate

Measures how often the Skill is incorrectly triggered when it shouldn't be.
- Formula: False Positives / (False Positives + True Negatives)
- Goal: < 10%

Accuracy

Measures whether the Skill output or generated changes match the standard answer. Supports exact match and semantic similarity evaluation.

For more detailed metric explanations, please refer to metrics_explained.md.

Skill-Studio 🚀

Skill-Studio 是一个用于评估和调试 Agent Skill 的skill。它通过自动化测试和深入分析，帮助开发者提升 Skill 的召回率（Recall Rate）和结果准确率（Accuracy），并提供针对性的修改建议。

🌟 核心特性

召回率评估 (Recall Evaluation): 测试 Skill 在各种提示指令下的触发稳定性。
准确率测试 (Accuracy Testing): 将 Skill 的运行结果与标准答案进行比对，支持语义相似度分析。
自动提示词变体 (Prompt Variations): 自动生成多种提示词，模拟真实用户的不同表达方式。
深度分析与建议 (Analysis & Recommendations): 根据测试指标，自动识别潜在问题并给出优化建议。
代码与状态跟踪 (State Tracking): 支持捕获文件系统变化，验证 Skill 对环境的影响。

📂 项目结构

scripts/: 核心评估脚本。
test_recall.py: 测试 Skill 触发召回率。
test_accuracy.py: 测试 Skill 执行结果准确率。
analyze_results.py: 分析结果并生成改进建议。
assets/: 包含测试用例和标准答案示例。
references/: 详细的指标定义、格式规范和技术文档。
SKILL.md: 包含 Skill 描述、触发条件、输入输出格式等详细信息。

🚀 快速上手

1. 准备测试用例（可选）

在 assets/ 目录下准备你的测试用例。可以参考 test_format.md 了解详细格式。

2. 在你的 agent 对话中，使用该 skill 来评估你的某个 skill，它将执行以下操作：

1. 运行召回率测试

测试 Skill 是否能在预期的场景下被正确触发：

python3 scripts/test_recall.py --test-cases assets/example_test_cases.json --skill-name my_awesome_skill

2. 运行准确率测试

验证 Skill 执行的结果是否符合预期：

python3 scripts/test_accuracy.py --test-cases assets/example_test_cases.json --standard-answers assets/example_standard_answers.json

3. 生成分析报告

分析测试结果并获取优化建议：

python3 scripts/analyze_results.py --recall-results results/recall_results.json --accuracy-results results/accuracy_results.json

📊 核心指标

召回率 (Recall Rate)

衡量 Skill 在应该触发时触发的频率。
- 计算公式: True Positives / (True Positives + False Negatives)
- 目标: > 90%

误报率 (False Positive Rate)

衡量 Skill 在不该触发时被错误触发的频率。
- 计算公式: False Positives / (False Positives + True Negatives)
- 目标: < 10%

准确率 (Accuracy)

衡量 Skill 输出内容或产生的变更是否符合标准答案。支持精确匹配和语义相似度评估。

更多详细指标说明请参考 metrics_explained.md。

希望 Skill-Studio 能帮助你构建更强大、更可靠发 Agent Skills！如有任何问题或建议，欢迎提交 Issue。

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.

skill-studio

# Description

# SKILL.md

Skill Studio

Overview

Core Workflow

Step 1: Collect Test Requirements

Step 2: Prepare Test Cases

Step 3: Run Recall Testing

Step 4: Run Accuracy Testing

Step 5: Analyze and Generate Recommendations

Step 6: Implement Improvements

Interpreting Results

Recall Rate Thresholds

Accuracy Rate Thresholds

Common Issues and Solutions

Best Practices

Test Case Design

Standard Answers

Iterative Testing

Skill Description Optimization

Resources

scripts/

references/

# README.md

Skill-Studio 🚀

🌟 Key Features

📂 Project Structure

🚀 Quick Start

1. Prepare Test Cases (Optional)

2. Use the Skill in your Agent conversation to evaluate a Skill:

1. Run Recall Test

2. Run Accuracy Test

3. Generate Analysis Report

📊 Core Metrics

Recall Rate

False Positive Rate

Accuracy

Skill-Studio 🚀

🌟 核心特性

📂 项目结构

🚀 快速上手

1. 准备测试用例（可选）

2. 在你的 agent 对话中，使用该 skill 来评估你的某个 skill，它将执行以下操作：

1. 运行召回率测试

2. 运行准确率测试

3. 生成分析报告

📊 核心指标

召回率 (Recall Rate)

误报率 (False Positive Rate)

准确率 (Accuracy)

# Related Skills

# Supported AI Coding Agents

Confirm

Submit a Skill