Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add notchrisgroves/ia-framework --skill "model-testing"
Install specific skill from multi-skill repository
# Description
Automated A/B testing and multi-model comparison for AI models with data-driven recommendations
# SKILL.md
name: model-testing
description: Automated A/B testing and multi-model comparison for AI models with data-driven recommendations
agent: engineer
version: 1.0
classification: public
last_updated: 2026-01-26
effort_default: STANDARD
β DUAL-PATH ROUTING - READ THIS FIRST
STOP. This skill requires the
engineeragent for complex requests.Identity check: If you are NOT the engineer agent AND your request is complex
(benchmark design, statistical analysis, infrastructure setup) β DELEGATE NOW:
typescript Task(subagent_type="engineer", prompt="Execute model-testing. Request: {user_request}")DO NOT proceed if you lack engineer expertise for:
- Benchmark suite design
- Statistical analysis and hypothesis testing
- Infrastructure coordination
- Custom metrics implementationPath 1 - Simple (Tier 1/Haiku): Basic comparisons
- "Compare performance of 2 models"
- "Run A/B test with standard prompts"
- Routes directly, no delegation neededPath 2 - Complex (Engineer): Advanced testing
- "Design benchmark suite for domain-specific tasks"
- "Multi-round testing with statistical validation"
- Requires engineer delegation
Model Testing Skill
Automated A/B testing and multi-model comparison for AI models with data-driven recommendations
Effort Classification
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TASK RECEIVED β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β Is this a single action? β
β (obvious answer, no deps) β
βββββββββββββββββ¬ββββββββββββββββ
β
ββββββββββββββ΄βββββββββββββ
β β
βΌ βΌ
βββββββββββββββ βββββββββββββββββββββββ
β YES β β NO β
β QUICK β ββββββββββββ¬βββββββββββ
β (direct) β β
βββββββββββββββ βΌ
βββββββββββββββββββββββ
β Complex/critical? β
β Research needed? β
β Multiple files? β
ββββββββββββ¬βββββββββββ
β
ββββββββββββ΄βββββββββββ
β β
βΌ βΌ
βββββββββββββββ βββββββββββββββ
β NO β β YES β
β STANDARD β β THOROUGH β
β (TodoWrite) β β (PlanMode) β
βββββββββββββββ βββββββββββββββ
| Level | When | Native Tool | Capabilities |
|---|---|---|---|
| QUICK | Single action, obvious answer | Direct response | No gates needed |
| STANDARD | Normal complexity | TodoWrite | Full workflow with criteria tracking |
| THOROUGH | Complex, research needed | EnterPlanMode | User approval required before execution |
Default: STANDARD (TodoWrite for tracking test runs)
Pre-flight Checklist (MANDATORY)
STOP! Before executing this skill:
- [ ] Read this SKILL.md completely
- [ ] Understand the 5-phase workflow
- [ ] OPENROUTER_API_KEY is set in
.env - [ ] Test configs ready (or will be created)
USE WHEN
Invoke this skill when:
- User wants to compare AI models (DeepSeek vs Sonnet, etc.)
- User wants A/B testing for model selection
- User wants multi-model comparison matrix
- User invokes /model-test
- User wants data-driven model recommendations
- Testing which model works best for specific task
DO NOT use when:
- User just wants to run a single model (use Task tool directly)
- User wants to test prompt variations (different skill domain)
- Testing non-AI/ML models
Quick Start
/model-test
/model-test --mode=analyze
/model-test --mode=config
Output: skills/model-testing/output/test-results/ with metrics and recommendations
5-Phase Workflow
ββββββββββββ βββββββββββ ββββββββββββ βββββββββββ βββββββββββ
βCONFIGURE βββββΆβ TEST βββββΆβ EVALUATE βββββΆβ ANALYZE βββββΆβ REPORT β
β(Setup) β β(Execute)β β(Measure) β β(Decide) β β(Show) β
βββββββ¬βββββ ββββββ¬βββββ βββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ
β β β β β
βΌ βΌ βΌ βΌ βΌ
Models Set Tests Run Metrics Recommendation Results
Up Ready In Parallel Collected Generated Presented
| Phase | Domain Name | Gate Question | Output |
|---|---|---|---|
| 1 | CONFIGURE | "Are test settings configured?" | Test config file |
| 2 | TEST | "Did all models complete successfully?" | Model outputs |
| 3 | EVALUATE | "Are all metrics measured?" | QA scores, costs, latency |
| 4 | ANALYZE | "Is recommendation ready?" | Model recommendation |
| 5 | REPORT | "Can user see results?" | Formatted report |
Phase 1: CONFIGURE (Setup)
Purpose: Set up models to test and test parameters
Gate Question: "Are test settings configured?"
See: skills/model-testing/phases/01-configure.md
Pass Criteria:
- [ ] Models selected (2+ for comparison)
- [ ] Test prompt defined
- [ ] Success criteria defined (QA threshold, cost limits)
- [ ] Test config saved to input/test-configs/
Phase 2: TEST (Execute)
Purpose: Run prompts through multiple models in parallel
Gate Question: "Did all models complete successfully?"
See: skills/model-testing/phases/02-test.md
Pass Criteria:
- [ ] All models called via OpenRouter API
- [ ] Outputs captured for each model
- [ ] Errors handled gracefully (retry or mark failed)
- [ ] Test metadata logged (timestamp, model versions, etc.)
Phase 3: EVALUATE (Measure)
Purpose: QA validate outputs and measure metrics
Gate Question: "Are all metrics measured?"
See: skills/model-testing/phases/03-evaluate.md
Pass Criteria:
- [ ] QA validation run on each output (tools/prompt-qa.ts)
- [ ] Cost calculated per model
- [ ] Latency measured per model
- [ ] Success/failure status recorded
- [ ] Metrics saved to output/test-results/metrics.jsonl
Phase 4: ANALYZE (Decide)
Purpose: Generate data-driven recommendations
Gate Question: "Is recommendation ready?"
See: skills/model-testing/phases/04-analyze.md
Pass Criteria:
- [ ] Sufficient test data collected (min 10 tests per model)
- [ ] Win rates calculated
- [ ] Quality vs cost analyzed
- [ ] Recommendation generated with confidence score
- [ ] Per-task breakdown (if multiple content types tested)
Phase 5: REPORT (Show)
Purpose: Present results to user
Gate Question: "Can user see results?"
See: skills/model-testing/phases/05-report.md
Pass Criteria:
- [ ] Summary statistics formatted
- [ ] Recommendation clearly stated
- [ ] Cost savings calculated
- [ ] Next steps provided
- [ ] Raw data available for review
Success Criteria Tracking
Overall Success:
- [ ] Test harness runs without errors
- [ ] All models tested successfully
- [ ] Metrics collected for all tests
- [ ] Recommendation generated with β₯70% confidence
- [ ] User can make informed decision
Per-Test Success:
- [ ] Both/all models complete
- [ ] QA validation passes for at least one model
- [ ] Cost and latency recorded
- [ ] Winner selected (or tie noted)
Modes
Test Mode (default)
/model-test
What it does: Runs A/B or multi-model test, collects metrics, auto-analyzes if threshold reached
Flow: Configure β Test β Evaluate β (Auto-analyze if N tests) β Report
Analyze Mode
/model-test --mode=analyze
What it does: Manually trigger analysis on collected test data
Flow: Load metrics β Analyze β Report
Config Mode
/model-test --mode=config
What it does: View or modify test configuration
Flow: Load config β Display β Allow edits β Save
Report Mode
/model-test --mode=report
What it does: View latest test results without running new tests
Flow: Load latest results β Format β Display
Error Recovery
| Error | Recovery Action |
|---|---|
| OPENROUTER_API_KEY missing | Prompt user to add to .env |
| Model API call failed | Retry 2x, then mark as failed (continue with other models) |
| QA validation error | Log warning, use raw output |
| Insufficient test data | Inform user, recommend running more tests |
| No clear winner | Report tie, suggest more tests or different criteria |
Input/Output
Input Directory: skills/model-testing/input/test-configs/
- default-config.yaml - Default test settings
- <task>-config.yaml - Per-task configurations
Output Directory: skills/model-testing/output/test-results/
- metrics.jsonl - All test results (JSON Lines format)
- <timestamp>-test.json - Individual test results
- analysis-<date>.md - Analysis reports
- recommendation.md - Latest recommendation
Metrics Collected
Per test:
- timestamp - When test ran
- models - Which models tested
- prompt - Test prompt used
- outputs - Model responses
- qa_scores - QA validation scores (0-5)
- costs - API costs per model
- latencies - Response times (ms)
- winner - Which model won this test
- success - Did test complete successfully
Aggregate:
- total_tests - Number of tests run
- win_rates - % wins per model
- avg_qa_scores - Average quality scores
- avg_costs - Average costs
- avg_latencies - Average response times
- success_rates - % successful completions
Configuration
Test Config Format (input/test-configs/default-config.yaml):
models:
- name: "deepseek-chat"
provider: "deepseek"
cost_per_1m: 0.00 # Free tier
- name: "claude-sonnet-4.5"
provider: "anthropic"
cost_per_1m: 3.00
success_criteria:
min_qa_score: 4 # Minimum QA score to consider "passing"
max_cost: 0.10 # Maximum acceptable cost per test
max_latency: 30000 # Maximum latency in ms
analysis:
auto_analyze_threshold: 50 # Auto-analyze after N tests
min_tests_for_recommendation: 10 # Minimum tests before recommendation
confidence_threshold: 0.70 # Minimum confidence for recommendation
quality_weight: 0.7 # How much to weight quality vs cost (0-1)
cost_weight: 0.3 # How much to weight cost vs quality (0-1)
Related Skills
/create-skill- Created this skill/write- Can use model-testing to optimize content generation/training- Can use model-testing to optimize workout plan generation
Version: 1.0
Last Updated: 2026-01-22
Status: Active
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.