model-testing

by @notchrisgroves in AI & LLM

# Install this skill:

npx skills add notchrisgroves/ia-framework --skill "model-testing"

Install specific skill from multi-skill repository

# Description

Automated A/B testing and multi-model comparison for AI models with data-driven recommendations

# SKILL.md

name: model-testing
description: Automated A/B testing and multi-model comparison for AI models with data-driven recommendations
agent: engineer
version: 1.0
classification: public
last_updated: 2026-01-26
effort_default: STANDARD

⛔ DUAL-PATH ROUTING - READ THIS FIRST

STOP. This skill requires the engineer agent for complex requests.

Identity check: If you are NOT the engineer agent AND your request is complex
(benchmark design, statistical analysis, infrastructure setup) → DELEGATE NOW:

typescript Task(subagent_type="engineer", prompt="Execute model-testing. Request: {user_request}")

DO NOT proceed if you lack engineer expertise for:
- Benchmark suite design
- Statistical analysis and hypothesis testing
- Infrastructure coordination
- Custom metrics implementation

Path 1 - Simple (Tier 1/Haiku): Basic comparisons
- "Compare performance of 2 models"
- "Run A/B test with standard prompts"
- Routes directly, no delegation needed

Path 2 - Complex (Engineer): Advanced testing
- "Design benchmark suite for domain-specific tasks"
- "Multi-round testing with statistical validation"
- Requires engineer delegation

Model Testing Skill

Automated A/B testing and multi-model comparison for AI models with data-driven recommendations

Effort Classification

┌─────────────────────────────────────────────────────────┐
│                    TASK RECEIVED                         │
└───────────────────────┬─────────────────────────────────┘
                        │
                        ▼
        ┌───────────────────────────────┐
        │   Is this a single action?    │
        │   (obvious answer, no deps)   │
        └───────────────┬───────────────┘
                        │
           ┌────────────┴────────────┐
           │                         │
           ▼                         ▼
    ┌─────────────┐           ┌─────────────────────┐
    │    YES      │           │        NO           │
    │   QUICK     │           └──────────┬──────────┘
    │ (direct)    │                      │
    └─────────────┘                      ▼
                              ┌─────────────────────┐
                              │ Complex/critical?   │
                              │ Research needed?    │
                              │ Multiple files?     │
                              └──────────┬──────────┘
                                         │
                              ┌──────────┴──────────┐
                              │                     │
                              ▼                     ▼
                       ┌─────────────┐       ┌─────────────┐
                       │     NO      │       │    YES      │
                       │  STANDARD   │       │  THOROUGH   │
                       │ (TodoWrite) │       │ (PlanMode)  │
                       └─────────────┘       └─────────────┘

Level	When	Native Tool	Capabilities
QUICK	Single action, obvious answer	Direct response	No gates needed
STANDARD	Normal complexity	TodoWrite	Full workflow with criteria tracking
THOROUGH	Complex, research needed	EnterPlanMode	User approval required before execution

Default: STANDARD (TodoWrite for tracking test runs)

Pre-flight Checklist (MANDATORY)

STOP! Before executing this skill:

[ ] Read this SKILL.md completely
[ ] Understand the 5-phase workflow
[ ] OPENROUTER_API_KEY is set in .env
[ ] Test configs ready (or will be created)

USE WHEN

Invoke this skill when:
- User wants to compare AI models (DeepSeek vs Sonnet, etc.)
- User wants A/B testing for model selection
- User wants multi-model comparison matrix
- User invokes /model-test
- User wants data-driven model recommendations
- Testing which model works best for specific task

DO NOT use when:
- User just wants to run a single model (use Task tool directly)
- User wants to test prompt variations (different skill domain)
- Testing non-AI/ML models

Quick Start

/model-test
/model-test --mode=analyze
/model-test --mode=config

Output: skills/model-testing/output/test-results/ with metrics and recommendations

5-Phase Workflow

┌──────────┐    ┌─────────┐    ┌──────────┐    ┌─────────┐    ┌─────────┐
│CONFIGURE │───▶│  TEST   │───▶│ EVALUATE │───▶│ ANALYZE │───▶│ REPORT  │
│(Setup)   │    │(Execute)│    │(Measure) │    │(Decide) │    │(Show)   │
└─────┬────┘    └────┬────┘    └─────┬────┘    └────┬────┘    └────┬────┘
      │              │               │              │              │
      ▼              ▼               ▼              ▼              ▼
  Models Set    Tests Run      Metrics       Recommendation   Results
  Up Ready      In Parallel    Collected     Generated        Presented

Phase	Domain Name	Gate Question	Output
1	CONFIGURE	"Are test settings configured?"	Test config file
2	TEST	"Did all models complete successfully?"	Model outputs
3	EVALUATE	"Are all metrics measured?"	QA scores, costs, latency
4	ANALYZE	"Is recommendation ready?"	Model recommendation
5	REPORT	"Can user see results?"	Formatted report

Phase 1: CONFIGURE (Setup)

Purpose: Set up models to test and test parameters

Gate Question: "Are test settings configured?"

See: skills/model-testing/phases/01-configure.md

Pass Criteria:
- [ ] Models selected (2+ for comparison)
- [ ] Test prompt defined
- [ ] Success criteria defined (QA threshold, cost limits)
- [ ] Test config saved to input/test-configs/

Phase 2: TEST (Execute)

Purpose: Run prompts through multiple models in parallel

Gate Question: "Did all models complete successfully?"

See: skills/model-testing/phases/02-test.md

Pass Criteria:
- [ ] All models called via OpenRouter API
- [ ] Outputs captured for each model
- [ ] Errors handled gracefully (retry or mark failed)
- [ ] Test metadata logged (timestamp, model versions, etc.)

Phase 3: EVALUATE (Measure)

Purpose: QA validate outputs and measure metrics

Gate Question: "Are all metrics measured?"

See: skills/model-testing/phases/03-evaluate.md

Pass Criteria:
- [ ] QA validation run on each output (tools/prompt-qa.ts)
- [ ] Cost calculated per model
- [ ] Latency measured per model
- [ ] Success/failure status recorded
- [ ] Metrics saved to output/test-results/metrics.jsonl

Phase 4: ANALYZE (Decide)

Purpose: Generate data-driven recommendations

Gate Question: "Is recommendation ready?"

See: skills/model-testing/phases/04-analyze.md

Pass Criteria:
- [ ] Sufficient test data collected (min 10 tests per model)
- [ ] Win rates calculated
- [ ] Quality vs cost analyzed
- [ ] Recommendation generated with confidence score
- [ ] Per-task breakdown (if multiple content types tested)

Phase 5: REPORT (Show)

Purpose: Present results to user

Gate Question: "Can user see results?"

See: skills/model-testing/phases/05-report.md

Pass Criteria:
- [ ] Summary statistics formatted
- [ ] Recommendation clearly stated
- [ ] Cost savings calculated
- [ ] Next steps provided
- [ ] Raw data available for review

Success Criteria Tracking

Overall Success:
- [ ] Test harness runs without errors
- [ ] All models tested successfully
- [ ] Metrics collected for all tests
- [ ] Recommendation generated with ≥70% confidence
- [ ] User can make informed decision

Per-Test Success:
- [ ] Both/all models complete
- [ ] QA validation passes for at least one model
- [ ] Cost and latency recorded
- [ ] Winner selected (or tie noted)

Modes

Test Mode (default)

/model-test

What it does: Runs A/B or multi-model test, collects metrics, auto-analyzes if threshold reached

Flow: Configure → Test → Evaluate → (Auto-analyze if N tests) → Report

Analyze Mode

/model-test --mode=analyze

What it does: Manually trigger analysis on collected test data

Flow: Load metrics → Analyze → Report

Config Mode

/model-test --mode=config

What it does: View or modify test configuration

Flow: Load config → Display → Allow edits → Save

Report Mode

/model-test --mode=report

What it does: View latest test results without running new tests

Flow: Load latest results → Format → Display

Error Recovery

Error	Recovery Action
OPENROUTER_API_KEY missing	Prompt user to add to .env
Model API call failed	Retry 2x, then mark as failed (continue with other models)
QA validation error	Log warning, use raw output
Insufficient test data	Inform user, recommend running more tests
No clear winner	Report tie, suggest more tests or different criteria

Input/Output

Input Directory: skills/model-testing/input/test-configs/
- default-config.yaml - Default test settings
- <task>-config.yaml - Per-task configurations

Output Directory: skills/model-testing/output/test-results/
- metrics.jsonl - All test results (JSON Lines format)
- <timestamp>-test.json - Individual test results
- analysis-<date>.md - Analysis reports
- recommendation.md - Latest recommendation

Metrics Collected

Per test:
- timestamp - When test ran
- models - Which models tested
- prompt - Test prompt used
- outputs - Model responses
- qa_scores - QA validation scores (0-5)
- costs - API costs per model
- latencies - Response times (ms)
- winner - Which model won this test
- success - Did test complete successfully

Aggregate:
- total_tests - Number of tests run
- win_rates - % wins per model
- avg_qa_scores - Average quality scores
- avg_costs - Average costs
- avg_latencies - Average response times
- success_rates - % successful completions

Configuration

Test Config Format (input/test-configs/default-config.yaml):

models:
  - name: "deepseek-chat"
    provider: "deepseek"
    cost_per_1m: 0.00  # Free tier
  - name: "claude-sonnet-4.5"
    provider: "anthropic"
    cost_per_1m: 3.00

success_criteria:
  min_qa_score: 4  # Minimum QA score to consider "passing"
  max_cost: 0.10   # Maximum acceptable cost per test
  max_latency: 30000  # Maximum latency in ms

analysis:
  auto_analyze_threshold: 50  # Auto-analyze after N tests
  min_tests_for_recommendation: 10  # Minimum tests before recommendation
  confidence_threshold: 0.70  # Minimum confidence for recommendation

quality_weight: 0.7  # How much to weight quality vs cost (0-1)
cost_weight: 0.3     # How much to weight cost vs quality (0-1)

/create-skill - Created this skill
/write - Can use model-testing to optimize content generation
/training - Can use model-testing to optimize workout plan generation

Version: 1.0
Last Updated: 2026-01-22
Status: Active

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.