Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add lzw12w/agent-skills-studio
Or install specific skill: npx add-skill https://github.com/lzw12w/agent-skills-studio
# Description
Comprehensive skill evaluation and debugging framework for testing agent skills. Use when users need to (1) evaluate a skill's recall rate (how often it triggers correctly), (2) test a skill's accuracy against expected outputs, (3) analyze skill performance with various prompts, or (4) generate improvement recommendations for existing skills. Requires test cases with standard answers.
# SKILL.md
name: skill-studio
description: "Comprehensive skill evaluation and debugging framework for testing agent skills. Use when users need to (1) evaluate a skill's recall rate (how often it triggers correctly), (2) test a skill's accuracy against expected outputs, (3) analyze skill performance with various prompts, or (4) generate improvement recommendations for existing skills. Requires test cases with standard answers."
Skill Studio
A systematic framework for evaluating and debugging agent skills through automated testing of recall rates and output accuracy.
Overview
Skill Studio provides tools and workflows to:
- Test if a skill triggers correctly (recall rate)
- Verify skill outputs match expected results (accuracy rate)
- Evaluate code outputs with specialized metrics (syntax, structure, functionality, quality)
- Generate actionable improvement recommendations
- Track performance across iterations
Core Workflow
Step 1: Collect Test Requirements
Gather from the user:
- Target skill name: The skill being evaluated (e.g., "pptx", "docx", "custom-skill")
- Test cases file: Path to test prompts (or help create one)
- Standard answers file: Path to expected outputs (or help create one)
- Test parameters: Number of prompt variations, testing approach
Step 2: Prepare Test Cases
If test cases don't exist, help create test_cases.json using the format in references/test_format.md.
Key elements:
- Diverse prompts that should trigger the skill
- Negative cases (prompts that should NOT trigger)
- Category labels for analysis
- Expected behaviors
Step 3: Run Recall Testing
Execute recall rate evaluation:
python3 scripts/test_recall.py \
--skill-name <skill-name> \
--test-cases <path-to-test-cases.json> \
--variations 5 \
--output outputs/recall_results.json
What it does:
- Generates prompt variations for each test case
- Tests if the skill triggers with each variation
- Calculates recall rates overall and by category
- Identifies false positives and false negatives
Key metrics:
- Overall recall rate: % of prompts that correctly triggered
- Per-category recall: Performance by test case type
- False positive rate: Incorrect triggering
Step 4: Run Accuracy Testing
Execute output accuracy evaluation:
python3 scripts/test_accuracy.py \
--skill-name <skill-name> \
--test-cases <path-to-test-cases.json> \
--standard-answers <path-to-standard-answers.json> \
--workspace <path-to-workspace-directory> \
--artifacts-dir test_artifacts \
--output outputs/accuracy_results.json
What it does:
- Runs the skill with test prompts
- Captures file changes (for coding agent skills)
- Compares outputs to standard answers
- Evaluates both exact matches and semantic similarity
- Saves before/after states and diffs
- Categorizes errors by type
Key parameters:
- --workspace: Optional. Path to workspace directory for capturing code changes
- --artifacts-dir: Optional. Directory to save test artifacts (default: test_artifacts)
Key metrics:
- Exact match accuracy: % of perfect matches
- Semantic similarity: Average similarity score (0-1)
- Error categories: Common failure patterns
Artifacts saved (when --workspace is provided):
For each test case, the following artifacts are saved in test_artifacts/<test_id>/:
- metadata.json: Test case information
- before_state.json: File states before skill execution
- after_state.json: File states after skill execution
- changes.json: Summary of added/modified/deleted files
- diffs/<file>.diff: Unified diffs for each modified file
- full_output.json: Complete skill response
This allows you to:
- Review exact code changes made by the skill
- Compare actual changes to expected changes
- Debug why tests failed
- Track how the skill evolves over iterations
For code outputs:
- Syntax validity: Does code parse correctly?
- Structure match: Has expected functions/classes/imports?
- Functionality: Passes test cases?
- Code quality: Comments, docstrings, error handling, type hints
Step 5: Analyze and Generate Recommendations
Run the analysis to get improvement suggestions:
python3 scripts/analyze_results.py \
--recall-results outputs/recall_results.json \
--accuracy-results outputs/accuracy_results.json \
--skill-path <path-to-skill-folder> \
--output outputs/recommendations.md
Output includes:
- Performance summary with visual indicators
- Specific issues identified (low-performing categories)
- Concrete recommendations for SKILL.md improvements
- Suggested trigger phrases for description
- Priority-ranked action items
Step 6: Implement Improvements
Based on recommendations:
1. Review the generated recommendations document
2. Update the skill's SKILL.md frontmatter description
3. Enhance instructions in SKILL.md body
4. Add clarifying examples or decision trees
5. Re-run tests to validate improvements
Iterate until target performance is achieved.
Interpreting Results
Recall Rate Thresholds
- >90%: Excellent - skill triggers reliably
- 70-90%: Good - minor description improvements helpful
- 50-70%: Fair - description needs refinement
- <50%: Poor - major triggering issues, likely unclear description
Accuracy Rate Thresholds
- >85%: Excellent - outputs are high quality
- 70-85%: Good - minor instruction improvements helpful
- 55-70%: Fair - instructions need clarification
- <55%: Poor - major workflow or instruction issues
Common Issues and Solutions
Low Recall Rate:
- Description is too vague or generic
- Missing key trigger keywords
- Overlaps with other skills' descriptions
- Solution: Add specific file types, actions, or scenarios to description
High False Positive Rate:
- Description is too broad
- Includes common generic terms
- Solution: Be more specific about exact use cases
Low Accuracy Rate:
- Instructions are unclear or incomplete
- Missing critical steps in workflow
- Examples don't match user expectations
- Solution: Add decision trees, more examples, clearer step-by-step guidance
Variable Performance by Category:
- Skill handles some use cases well but not others
- Solution: Add specific guidance for underperforming categories
Best Practices
Test Case Design
- Include 20-30 diverse test cases minimum
- Cover all major use cases the skill should handle
- Add negative cases (should NOT trigger) at ~20% ratio
- Use realistic, varied phrasing
- Label test cases by category for granular analysis
Standard Answers
- Define clear quality criteria
- Use semantic similarity for flexible matching when appropriate
- Include both structure and content expectations
- Specify file formats and key elements
Iterative Testing
- Test after each modification to track progress
- Focus on lowest-performing categories first
- Keep a log of changes and their impact
- Aim for incremental improvements (5-10% gains per iteration)
Skill Description Optimization
- Include specific file extensions (e.g., ".docx", ".pptx")
- List concrete actions (e.g., "creating", "editing", "analyzing")
- Mention key scenarios (e.g., "when user uploads", "when user requests")
- Use numbered lists for multiple trigger conditions
- Avoid generic terms that apply to many skills
Resources
scripts/
test_recall.py: Measures skill triggering consistencytest_accuracy.py: Evaluates output qualityanalyze_results.py: Generates improvement recommendationsgenerate_variations.py: Creates prompt variations for testing
references/
test_format.md: Detailed test case format specificationanswer_format.md: Standard answer format specificationcode_evaluation.md: Code output evaluation format and best practicesmetrics_explained.md: Deep dive into evaluation metrics
# README.md
Skill-Studio 🚀
Skill-Studio is a comprehensive skill for evaluating and debugging Agent Skills. Through automated testing and in-depth analysis, it helps developers improve Skill Recall Rate and result Accuracy, providing targeted recommendations for improvement.
🌟 Key Features
- Recall Evaluation: Tests the triggering stability of Skills under various prompt variations.
- Accuracy Testing: Compares Skill outputs against standard answers, supporting semantic similarity analysis.
- Automated Prompt Variations: Automatically generates multiple prompt variations to simulate different user expressions.
- In-depth Analysis & Recommendations: Automatically identifies potential issues and provides optimization suggestions based on test metrics.
- Code & State Tracking: Supports capturing file system changes to verify the Skill's impact on the environment.
📂 Project Structure
- scripts/: Core evaluation scripts.
- test_recall.py: Tests Skill trigger recall rate.
- test_accuracy.py: Tests Skill execution accuracy.
- analyze_results.py: Analyzes results and generates improvement suggestions.
- assets/: Contains test cases and standard answer examples.
- references/: Detailed metric definitions, format specifications, and technical documentation.
- SKILL.md: Detailed information including Skill descriptions, trigger conditions, input/output formats, etc.
🚀 Quick Start
1. Prepare Test Cases (Optional)
Prepare your test cases in the assets/ directory. Refer to test_format.md for detailed format information.
2. Use the Skill in your Agent conversation to evaluate a Skill:
1. Run Recall Test
Test if the Skill is correctly triggered in expected scenarios:
python3 scripts/test_recall.py --test-cases assets/example_test_cases.json --skill-name my_awesome_skill
2. Run Accuracy Test
Verify if the Skill execution results meet expectations:
python3 scripts/test_accuracy.py --test-cases assets/example_test_cases.json --standard-answers assets/example_standard_answers.json
3. Generate Analysis Report
Analyze test results and get optimization suggestions:
python3 scripts/analyze_results.py --recall-results results/recall_results.json --accuracy-results results/accuracy_results.json
📊 Core Metrics
Recall Rate
Measures how often the Skill triggers when it should.
- Formula: True Positives / (True Positives + False Negatives)
- Goal: > 90%
False Positive Rate
Measures how often the Skill is incorrectly triggered when it shouldn't be.
- Formula: False Positives / (False Positives + True Negatives)
- Goal: < 10%
Accuracy
Measures whether the Skill output or generated changes match the standard answer. Supports exact match and semantic similarity evaluation.
For more detailed metric explanations, please refer to metrics_explained.md.
Skill-Studio 🚀
Skill-Studio 是一个用于评估和调试 Agent Skill 的skill。它通过自动化测试和深入分析,帮助开发者提升 Skill 的召回率(Recall Rate)和结果准确率(Accuracy),并提供针对性的修改建议。
🌟 核心特性
- 召回率评估 (Recall Evaluation): 测试 Skill 在各种提示指令下的触发稳定性。
- 准确率测试 (Accuracy Testing): 将 Skill 的运行结果与标准答案进行比对,支持语义相似度分析。
- 自动提示词变体 (Prompt Variations): 自动生成多种提示词,模拟真实用户的不同表达方式。
- 深度分析与建议 (Analysis & Recommendations): 根据测试指标,自动识别潜在问题并给出优化建议。
- 代码与状态跟踪 (State Tracking): 支持捕获文件系统变化,验证 Skill 对环境的影响。
📂 项目结构
- scripts/: 核心评估脚本。
- test_recall.py: 测试 Skill 触发召回率。
- test_accuracy.py: 测试 Skill 执行结果准确率。
- analyze_results.py: 分析结果并生成改进建议。
- assets/: 包含测试用例和标准答案示例。
- references/: 详细的指标定义、格式规范和技术文档。
- SKILL.md: 包含 Skill 描述、触发条件、输入输出格式等详细信息。
🚀 快速上手
1. 准备测试用例(可选)
在 assets/ 目录下准备你的测试用例。可以参考 test_format.md 了解详细格式。
2. 在你的 agent 对话中,使用该 skill 来评估你的某个 skill,它将执行以下操作:
1. 运行召回率测试
测试 Skill 是否能在预期的场景下被正确触发:
python3 scripts/test_recall.py --test-cases assets/example_test_cases.json --skill-name my_awesome_skill
2. 运行准确率测试
验证 Skill 执行的结果是否符合预期:
python3 scripts/test_accuracy.py --test-cases assets/example_test_cases.json --standard-answers assets/example_standard_answers.json
3. 生成分析报告
分析测试结果并获取优化建议:
python3 scripts/analyze_results.py --recall-results results/recall_results.json --accuracy-results results/accuracy_results.json
📊 核心指标
召回率 (Recall Rate)
衡量 Skill 在应该触发时触发的频率。
- 计算公式: True Positives / (True Positives + False Negatives)
- 目标: > 90%
误报率 (False Positive Rate)
衡量 Skill 在不该触发时被错误触发的频率。
- 计算公式: False Positives / (False Positives + True Negatives)
- 目标: < 10%
准确率 (Accuracy)
衡量 Skill 输出内容或产生的变更是否符合标准答案。支持精确匹配和语义相似度评估。
更多详细指标说明请参考 metrics_explained.md。
希望 Skill-Studio 能帮助你构建更强大、更可靠发 Agent Skills!如有任何问题或建议,欢迎提交 Issue。
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.