test-runner

by @philoserf in Development

# Install this skill:

npx skills add philoserf/claude-code-setup --skill "test-runner"

Install specific skill from multi-skill repository

# Description

Runs systematic tests on Claude Code customizations. Executes sample queries, validates responses, generates test reports, and identifies edge cases for agents, commands, skills, and hooks.

# SKILL.md

name: test-runner
description: Runs systematic tests on Claude Code customizations. Executes sample queries, validates responses, generates test reports, and identifies edge cases for agents, commands, skills, and hooks.
allowed-tools: [Read, Write, Glob, Grep, Bash, Skill]

model: inherit

Reference Files

This skill uses reference materials:

examples.md - Concrete test case examples for different customization types
common-failures.md - Catalog of common failure patterns

Focus Areas

Sample Query Generation - Creating realistic test queries based on descriptions
Expected Behavior Validation - Verifying outputs match specifications
Regression Testing - Ensuring changes don't break existing functionality
Edge Case Identification - Finding unusual scenarios and boundary conditions
Integration Testing - Validating customizations work together
Performance Assessment - Analyzing context usage and efficiency

Test Framework

Test Types

Functional Tests

Purpose: Verify core functionality works as specified

Process:

Generate test queries from description/documentation
Execute customization with test input
Compare actual output to expected behavior
Record PASS/FAIL for each test case

Integration Tests

Purpose: Ensure customizations work together

Process:

Test hook interactions (PreToolUse, PostToolUse chains)
Verify skills can invoke sub-agents
Check commands delegate correctly
Validate settings.json configuration
Test tool permission boundaries

Usability Tests

Purpose: Assess user experience quality

Process:

Evaluate error messages (are they helpful?)
Check documentation completeness
Test edge cases (what breaks it?)
Assess output clarity
Verify examples work as shown

Test Execution Strategy

For Skills

Discovery Test: Generate queries that should trigger the skill
Invocation Test: Actually invoke the skill with sample query
Output Test: Verify skill produces expected results
Tool Test: Confirm only allowed tools are used
Reference Test: Check that references load correctly

For Agents

Frontmatter Test: Validate YAML structure
Invocation Test: Invoke agent with test prompt
Tool Test: Verify agent uses appropriate tools
Output Test: Check output format and quality
Context Test: Measure context usage

For Commands

Delegation Test: Verify command invokes correct agent/skill
Usage Test: Test with valid and invalid arguments
Documentation Test: Verify usage instructions are accurate
Output Test: Check output format and clarity

For Hooks

Input Test: Verify JSON stdin handling
Exit Code Test: Confirm 0 (allow) and 2 (block) work correctly
Error Handling Test: Verify graceful degradation
Performance Test: Check execution speed
Integration Test: Test hook chain behavior

Test Process

Step 1: Identify Customization Type

Determine what to test:

Agent (in agents/)
Command (in commands/)
Skill (in skills/)
Hook (in hooks/)

Step 2: Read Documentation

Use Read tool to examine:

Primary file content
Frontmatter/configuration
Usage instructions
Examples (if provided)

Step 3: Generate Test Cases

Based on description and documentation:

For Skills:

Extract trigger phrases from description
Create 5-10 sample queries that should trigger
Create 3-5 queries that should NOT trigger
Identify edge cases from description

For Agents:

Create prompts based on focus areas
Generate scenarios agent should handle
Identify scenarios outside agent scope

For Commands:

Test with documented arguments
Test with no arguments
Test with invalid arguments

For Hooks:

Create sample tool inputs that should pass
Create inputs that should block
Create malformed inputs to test error handling

Step 4: Execute Tests

Read-Only Testing (default):

Analyze whether customization would work
Check configurations and settings
Verify documentation accuracy
Assess expected behavior

Active Testing (when appropriate):

Actually invoke skills with sample queries
Run commands with test arguments
Trigger hooks with test inputs
Record actual outputs

Step 5: Compare Results

For each test:

Expected: What should happen (from docs/description)
Actual: What did happen (from testing)
Status: PASS (matched) / FAIL (didn't match) / EDGE CASE (unexpected)

Step 6: Generate Test Report

Create structured report following output format.

Output Format

# Test Report: {name}

**Type**: {agent|command|skill|hook}
**File**: {path}
**Tested**: {YYYY-MM-DD HH:MM}
**Test Mode**: {read-only|active}

## Summary

{1-2 sentence overview of what was tested and overall results}

## Test Results

**Total Tests**: {count}
**Passed**: {count} ({percentage}%)
**Failed**: {count} ({percentage}%)
**Edge Cases**: {count}

## Functional Tests

### Test 1: {test name}

- **Input**: {test input/query}
- **Expected**: {expected behavior}
- **Actual**: {actual behavior}
- **Status**: PASS | FAIL | EDGE CASE
- **Notes**: {observations}

### Test 2: {test name}

...

## Integration Tests

{If applicable - tests with other customizations}

### Integration 1: {integration name}

- **Components**: {what was tested together}
- **Expected**: {expected interaction}
- **Actual**: {actual interaction}
- **Status**: PASS | FAIL
- **Notes**: {observations}

## Usability Assessment

- **Documentation**: CLEAR | UNCLEAR | MISSING
- **Error Messages**: HELPFUL | CONFUSING | ABSENT
- **Examples**: WORKING | BROKEN | MISSING
- **Overall UX**: EXCELLENT | GOOD | NEEDS WORK | POOR

## Edge Cases Discovered

{Unusual scenarios or boundary conditions found during testing}

1. {edge case 1}
   - **Impact**: {severity}
   - **Recommendation**: {how to handle}

2. {edge case 2}
   ...

## Performance Metrics

{If applicable}

- **Context Usage**: ~{token estimate}
- **Execution Time**: {estimate or N/A}
- **Resource Impact**: LOW | MEDIUM | HIGH

## Failures and Issues

{Detailed analysis of failed tests}

### Failure 1: {test name}

- **Why It Failed**: {root cause}
- **Expected vs Actual**: {specific diff}
- **Fix Recommendation**: {how to resolve}

### Failure 2: {test name}

...

## Recommendations

### Priority 1 (Critical)

{Must-fix issues that prevent functionality}

1. {critical fix 1}
2. {critical fix 2}

### Priority 2 (Important)

{Should-fix issues that impact usability}

1. {important fix 1}
2. {important fix 2}

### Priority 3 (Nice-to-Have)

{Could-fix issues that polish the experience}

1. {enhancement 1}
2. {enhancement 2}

## Next Steps

{Specific actions to address failures and improve tests}

Test Case Examples

For detailed test case examples for skills, agents, commands, and hooks, see examples.md.

Best Practices for Testing

Test Both Paths: Success cases AND failure cases
Edge Cases Matter: Test boundaries and unusual inputs
Clear Expected Behavior: Document what should happen
Realistic Queries: Use natural language users would actually type
Integration Testing: Don't just test in isolation
Performance Aware: Note if tests are slow or heavy
Regression Testing: Re-test after changes
Document Failures: Explain why tests failed, not just that they did
Actionable Recommendations: Provide specific fixes
Version Testing: Note which version was tested

Common Test Failures

For a catalog of common failure patterns by customization type, see common-failures.md.

Tools Used

This skill uses these tools for testing:

Read - Examine customization files
Write - Generate and save test reports to files
Grep - Search for patterns and configurations
Glob - Find files and customizations
Bash - Execute read-only commands for analysis
Skill - Invoke skills for active testing

In read-only mode (default), no customizations are actually invoked - the skill analyzes configurations and documentation to assess expected behavior. In active mode, skills are invoked with test queries to verify actual behavior.

Test reports are written to ~/.claude/logs/evaluations/tests/ using the Write tool.

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.

test-runner

# Description

# SKILL.md

model: inherit

Reference Files

Focus Areas

Test Framework

Test Types

Functional Tests

Integration Tests

Usability Tests

Test Execution Strategy

For Skills

For Agents

For Commands

For Hooks

Test Process

Step 1: Identify Customization Type

Step 2: Read Documentation

Step 3: Generate Test Cases

Step 4: Execute Tests

Step 5: Compare Results

Step 6: Generate Test Report

Output Format

Test Case Examples

Best Practices for Testing

Common Test Failures

Tools Used

# Related Skills

# Supported AI Coding Agents

Confirm

Submit a Skill