lzw12w

skill-studio

1
0
# Install this skill:
npx skills add lzw12w/agent-skills-studio

Or install specific skill: npx add-skill https://github.com/lzw12w/agent-skills-studio

# Description

Comprehensive skill evaluation and debugging framework for testing agent skills. Use when users need to (1) evaluate a skill's recall rate (how often it triggers correctly), (2) test a skill's accuracy against expected outputs, (3) analyze skill performance with various prompts, or (4) generate improvement recommendations for existing skills. Requires test cases with standard answers.

# SKILL.md


name: skill-studio
description: "Comprehensive skill evaluation and debugging framework for testing agent skills. Use when users need to (1) evaluate a skill's recall rate (how often it triggers correctly), (2) test a skill's accuracy against expected outputs, (3) analyze skill performance with various prompts, or (4) generate improvement recommendations for existing skills. Requires test cases with standard answers."


Skill Studio

A systematic framework for evaluating and debugging agent skills through automated testing of recall rates and output accuracy.

Overview

Skill Studio provides tools and workflows to:
- Test if a skill triggers correctly (recall rate)
- Verify skill outputs match expected results (accuracy rate)
- Evaluate code outputs with specialized metrics (syntax, structure, functionality, quality)
- Generate actionable improvement recommendations
- Track performance across iterations

Core Workflow

Step 1: Collect Test Requirements

Gather from the user:
- Target skill name: The skill being evaluated (e.g., "pptx", "docx", "custom-skill")
- Test cases file: Path to test prompts (or help create one)
- Standard answers file: Path to expected outputs (or help create one)
- Test parameters: Number of prompt variations, testing approach

Step 2: Prepare Test Cases

If test cases don't exist, help create test_cases.json using the format in references/test_format.md.

Key elements:
- Diverse prompts that should trigger the skill
- Negative cases (prompts that should NOT trigger)
- Category labels for analysis
- Expected behaviors

Step 3: Run Recall Testing

Execute recall rate evaluation:

python3 scripts/test_recall.py \
  --skill-name <skill-name> \
  --test-cases <path-to-test-cases.json> \
  --variations 5 \
  --output outputs/recall_results.json

What it does:
- Generates prompt variations for each test case
- Tests if the skill triggers with each variation
- Calculates recall rates overall and by category
- Identifies false positives and false negatives

Key metrics:
- Overall recall rate: % of prompts that correctly triggered
- Per-category recall: Performance by test case type
- False positive rate: Incorrect triggering

Step 4: Run Accuracy Testing

Execute output accuracy evaluation:

python3 scripts/test_accuracy.py \
  --skill-name <skill-name> \
  --test-cases <path-to-test-cases.json> \
  --standard-answers <path-to-standard-answers.json> \
  --workspace <path-to-workspace-directory> \
  --artifacts-dir test_artifacts \
  --output outputs/accuracy_results.json

What it does:
- Runs the skill with test prompts
- Captures file changes (for coding agent skills)
- Compares outputs to standard answers
- Evaluates both exact matches and semantic similarity
- Saves before/after states and diffs
- Categorizes errors by type

Key parameters:
- --workspace: Optional. Path to workspace directory for capturing code changes
- --artifacts-dir: Optional. Directory to save test artifacts (default: test_artifacts)

Key metrics:
- Exact match accuracy: % of perfect matches
- Semantic similarity: Average similarity score (0-1)
- Error categories: Common failure patterns

Artifacts saved (when --workspace is provided):

For each test case, the following artifacts are saved in test_artifacts/<test_id>/:
- metadata.json: Test case information
- before_state.json: File states before skill execution
- after_state.json: File states after skill execution
- changes.json: Summary of added/modified/deleted files
- diffs/<file>.diff: Unified diffs for each modified file
- full_output.json: Complete skill response

This allows you to:
- Review exact code changes made by the skill
- Compare actual changes to expected changes
- Debug why tests failed
- Track how the skill evolves over iterations

For code outputs:
- Syntax validity: Does code parse correctly?
- Structure match: Has expected functions/classes/imports?
- Functionality: Passes test cases?
- Code quality: Comments, docstrings, error handling, type hints

Step 5: Analyze and Generate Recommendations

Run the analysis to get improvement suggestions:

python3 scripts/analyze_results.py \
  --recall-results outputs/recall_results.json \
  --accuracy-results outputs/accuracy_results.json \
  --skill-path <path-to-skill-folder> \
  --output outputs/recommendations.md

Output includes:
- Performance summary with visual indicators
- Specific issues identified (low-performing categories)
- Concrete recommendations for SKILL.md improvements
- Suggested trigger phrases for description
- Priority-ranked action items

Step 6: Implement Improvements

Based on recommendations:
1. Review the generated recommendations document
2. Update the skill's SKILL.md frontmatter description
3. Enhance instructions in SKILL.md body
4. Add clarifying examples or decision trees
5. Re-run tests to validate improvements

Iterate until target performance is achieved.

Interpreting Results

Recall Rate Thresholds

  • >90%: Excellent - skill triggers reliably
  • 70-90%: Good - minor description improvements helpful
  • 50-70%: Fair - description needs refinement
  • <50%: Poor - major triggering issues, likely unclear description

Accuracy Rate Thresholds

  • >85%: Excellent - outputs are high quality
  • 70-85%: Good - minor instruction improvements helpful
  • 55-70%: Fair - instructions need clarification
  • <55%: Poor - major workflow or instruction issues

Common Issues and Solutions

Low Recall Rate:
- Description is too vague or generic
- Missing key trigger keywords
- Overlaps with other skills' descriptions
- Solution: Add specific file types, actions, or scenarios to description

High False Positive Rate:
- Description is too broad
- Includes common generic terms
- Solution: Be more specific about exact use cases

Low Accuracy Rate:
- Instructions are unclear or incomplete
- Missing critical steps in workflow
- Examples don't match user expectations
- Solution: Add decision trees, more examples, clearer step-by-step guidance

Variable Performance by Category:
- Skill handles some use cases well but not others
- Solution: Add specific guidance for underperforming categories

Best Practices

Test Case Design

  • Include 20-30 diverse test cases minimum
  • Cover all major use cases the skill should handle
  • Add negative cases (should NOT trigger) at ~20% ratio
  • Use realistic, varied phrasing
  • Label test cases by category for granular analysis

Standard Answers

  • Define clear quality criteria
  • Use semantic similarity for flexible matching when appropriate
  • Include both structure and content expectations
  • Specify file formats and key elements

Iterative Testing

  • Test after each modification to track progress
  • Focus on lowest-performing categories first
  • Keep a log of changes and their impact
  • Aim for incremental improvements (5-10% gains per iteration)

Skill Description Optimization

  • Include specific file extensions (e.g., ".docx", ".pptx")
  • List concrete actions (e.g., "creating", "editing", "analyzing")
  • Mention key scenarios (e.g., "when user uploads", "when user requests")
  • Use numbered lists for multiple trigger conditions
  • Avoid generic terms that apply to many skills

Resources

scripts/

  • test_recall.py: Measures skill triggering consistency
  • test_accuracy.py: Evaluates output quality
  • analyze_results.py: Generates improvement recommendations
  • generate_variations.py: Creates prompt variations for testing

references/

  • test_format.md: Detailed test case format specification
  • answer_format.md: Standard answer format specification
  • code_evaluation.md: Code output evaluation format and best practices
  • metrics_explained.md: Deep dive into evaluation metrics

# README.md

English | 中文

Skill-Studio 🚀

GitHub Repository

Skill-Studio is a comprehensive skill for evaluating and debugging Agent Skills. Through automated testing and in-depth analysis, it helps developers improve Skill Recall Rate and result Accuracy, providing targeted recommendations for improvement.

🌟 Key Features

  • Recall Evaluation: Tests the triggering stability of Skills under various prompt variations.
  • Accuracy Testing: Compares Skill outputs against standard answers, supporting semantic similarity analysis.
  • Automated Prompt Variations: Automatically generates multiple prompt variations to simulate different user expressions.
  • In-depth Analysis & Recommendations: Automatically identifies potential issues and provides optimization suggestions based on test metrics.
  • Code & State Tracking: Supports capturing file system changes to verify the Skill's impact on the environment.

📂 Project Structure

  • scripts/: Core evaluation scripts.
  • test_recall.py: Tests Skill trigger recall rate.
  • test_accuracy.py: Tests Skill execution accuracy.
  • analyze_results.py: Analyzes results and generates improvement suggestions.
  • assets/: Contains test cases and standard answer examples.
  • references/: Detailed metric definitions, format specifications, and technical documentation.
  • SKILL.md: Detailed information including Skill descriptions, trigger conditions, input/output formats, etc.

🚀 Quick Start

1. Prepare Test Cases (Optional)

Prepare your test cases in the assets/ directory. Refer to test_format.md for detailed format information.

2. Use the Skill in your Agent conversation to evaluate a Skill:

1. Run Recall Test

Test if the Skill is correctly triggered in expected scenarios:

python3 scripts/test_recall.py --test-cases assets/example_test_cases.json --skill-name my_awesome_skill

2. Run Accuracy Test

Verify if the Skill execution results meet expectations:

python3 scripts/test_accuracy.py --test-cases assets/example_test_cases.json --standard-answers assets/example_standard_answers.json

3. Generate Analysis Report

Analyze test results and get optimization suggestions:

python3 scripts/analyze_results.py --recall-results results/recall_results.json --accuracy-results results/accuracy_results.json

📊 Core Metrics

Recall Rate

Measures how often the Skill triggers when it should.
- Formula: True Positives / (True Positives + False Negatives)
- Goal: > 90%

False Positive Rate

Measures how often the Skill is incorrectly triggered when it shouldn't be.
- Formula: False Positives / (False Positives + True Negatives)
- Goal: < 10%

Accuracy

Measures whether the Skill output or generated changes match the standard answer. Supports exact match and semantic similarity evaluation.

For more detailed metric explanations, please refer to metrics_explained.md.


Skill-Studio 🚀

GitHub Repository

Skill-Studio 是一个用于评估和调试 Agent Skill 的skill。它通过自动化测试和深入分析,帮助开发者提升 Skill 的召回率(Recall Rate)和结果准确率(Accuracy),并提供针对性的修改建议。

🌟 核心特性

  • 召回率评估 (Recall Evaluation): 测试 Skill 在各种提示指令下的触发稳定性。
  • 准确率测试 (Accuracy Testing): 将 Skill 的运行结果与标准答案进行比对,支持语义相似度分析。
  • 自动提示词变体 (Prompt Variations): 自动生成多种提示词,模拟真实用户的不同表达方式。
  • 深度分析与建议 (Analysis & Recommendations): 根据测试指标,自动识别潜在问题并给出优化建议。
  • 代码与状态跟踪 (State Tracking): 支持捕获文件系统变化,验证 Skill 对环境的影响。

📂 项目结构

🚀 快速上手

1. 准备测试用例(可选)

assets/ 目录下准备你的测试用例。可以参考 test_format.md 了解详细格式。

2. 在你的 agent 对话中,使用该 skill 来评估你的某个 skill,它将执行以下操作:

1. 运行召回率测试

测试 Skill 是否能在预期的场景下被正确触发:

python3 scripts/test_recall.py --test-cases assets/example_test_cases.json --skill-name my_awesome_skill

2. 运行准确率测试

验证 Skill 执行的结果是否符合预期:

python3 scripts/test_accuracy.py --test-cases assets/example_test_cases.json --standard-answers assets/example_standard_answers.json

3. 生成分析报告

分析测试结果并获取优化建议:

python3 scripts/analyze_results.py --recall-results results/recall_results.json --accuracy-results results/accuracy_results.json

📊 核心指标

召回率 (Recall Rate)

衡量 Skill 在应该触发时触发的频率。
- 计算公式: True Positives / (True Positives + False Negatives)
- 目标: > 90%

误报率 (False Positive Rate)

衡量 Skill 在不该触发时被错误触发的频率。
- 计算公式: False Positives / (False Positives + True Negatives)
- 目标: < 10%

准确率 (Accuracy)

衡量 Skill 输出内容或产生的变更是否符合标准答案。支持精确匹配和语义相似度评估。

更多详细指标说明请参考 metrics_explained.md


希望 Skill-Studio 能帮助你构建更强大、更可靠发 Agent Skills!如有任何问题或建议,欢迎提交 Issue。

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.