Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add OmidZamani/dspy-skills --skill "dspy-evaluation-suite"
Install specific skill from multi-skill repository
# Description
This skill should be used when the user asks to "evaluate a DSPy program", "test my DSPy module", "measure performance", "create evaluation metrics", "use answer_exact_match or SemanticF1", mentions "Evaluate class", "comparing programs", "establishing baselines", or needs to systematically test and measure DSPy program quality with custom or built-in metrics.
# SKILL.md
name: dspy-evaluation-suite
version: "1.0.0"
dspy-compatibility: "3.1.2"
description: This skill should be used when the user asks to "evaluate a DSPy program", "test my DSPy module", "measure performance", "create evaluation metrics", "use answer_exact_match or SemanticF1", mentions "Evaluate class", "comparing programs", "establishing baselines", or needs to systematically test and measure DSPy program quality with custom or built-in metrics.
allowed-tools:
- Read
- Write
- Glob
- Grep
DSPy Evaluation Suite
Goal
Systematically evaluate DSPy programs using built-in and custom metrics with parallel execution.
When to Use
- Measuring program performance before/after optimization
- Comparing different program variants
- Establishing baselines
- Validating production readiness
Related Skills
- Use with any optimizer: dspy-bootstrap-fewshot, dspy-miprov2-optimizer, dspy-gepa-reflective
- Evaluate RAG pipelines: dspy-rag-pipeline
Inputs
| Input | Type | Description |
|---|---|---|
program |
dspy.Module |
Program to evaluate |
devset |
list[dspy.Example] |
Evaluation examples |
metric |
callable |
Scoring function |
num_threads |
int |
Parallel threads |
Outputs
| Output | Type | Description |
|---|---|---|
score |
float |
Average metric score |
results |
list |
Per-example results |
Workflow
Phase 1: Setup Evaluator
from dspy.evaluate import Evaluate
evaluator = Evaluate(
devset=devset,
metric=my_metric,
num_threads=8,
display_progress=True
)
Phase 2: Run Evaluation
result = evaluator(my_program)
print(f"Score: {result.score:.2f}%")
# Access individual results: (example, prediction, score) tuples
for example, pred, score in result.results[:3]:
print(f"Example: {example.question[:50]}... Score: {score}")
Built-in Metrics
answer_exact_match
import dspy
# Normalized, case-insensitive comparison
metric = dspy.evaluate.answer_exact_match
SemanticF1
LLM-based semantic evaluation:
from dspy.evaluate import SemanticF1
semantic = SemanticF1()
score = semantic(example, prediction)
Custom Metrics
Basic Metric
def exact_match(example, pred, trace=None):
"""Returns bool, int, or float."""
return example.answer.lower().strip() == pred.answer.lower().strip()
Multi-Factor Metric
def quality_metric(example, pred, trace=None):
"""Score based on multiple factors."""
score = 0.0
# Correctness (50%)
if example.answer.lower() in pred.answer.lower():
score += 0.5
# Conciseness (25%)
if len(pred.answer.split()) <= 20:
score += 0.25
# Has reasoning (25%)
if hasattr(pred, 'reasoning') and pred.reasoning:
score += 0.25
return score
GEPA-Compatible Metric
def feedback_metric(example, pred, trace=None):
"""Returns (score, feedback) for GEPA optimizer."""
correct = example.answer.lower() in pred.answer.lower()
if correct:
return 1.0, "Correct answer provided."
else:
return 0.0, f"Expected '{example.answer}', got '{pred.answer}'"
Production Example
import dspy
from dspy.evaluate import Evaluate, SemanticF1
import json
import logging
from typing import Optional
from dataclasses import dataclass
logger = logging.getLogger(__name__)
@dataclass
class EvaluationResult:
score: float
num_examples: int
correct: int
incorrect: int
errors: int
def comprehensive_metric(example, pred, trace=None) -> float:
"""Multi-dimensional evaluation metric."""
scores = []
# 1. Correctness
if hasattr(example, 'answer') and hasattr(pred, 'answer'):
correct = example.answer.lower().strip() in pred.answer.lower().strip()
scores.append(1.0 if correct else 0.0)
# 2. Completeness (answer not empty or error)
if hasattr(pred, 'answer'):
complete = len(pred.answer.strip()) > 0 and "error" not in pred.answer.lower()
scores.append(1.0 if complete else 0.0)
# 3. Reasoning quality (if available)
if hasattr(pred, 'reasoning'):
has_reasoning = len(str(pred.reasoning)) > 20
scores.append(1.0 if has_reasoning else 0.5)
return sum(scores) / len(scores) if scores else 0.0
class EvaluationSuite:
def __init__(self, devset, num_threads=8):
self.devset = devset
self.num_threads = num_threads
def evaluate(self, program, metric=None) -> EvaluationResult:
"""Run full evaluation with detailed results."""
metric = metric or comprehensive_metric
evaluator = Evaluate(
devset=self.devset,
metric=metric,
num_threads=self.num_threads,
display_progress=True
)
eval_result = evaluator(program)
# Extract individual scores from results
scores = [score for example, pred, score in eval_result.results]
correct = sum(1 for s in scores if s >= 0.5)
errors = sum(1 for s in scores if s == 0)
return EvaluationResult(
score=eval_result.score,
num_examples=len(self.devset),
correct=correct,
incorrect=len(self.devset) - correct - errors,
errors=errors
)
def compare(self, programs: dict, metric=None) -> dict:
"""Compare multiple programs."""
results = {}
for name, program in programs.items():
logger.info(f"Evaluating: {name}")
results[name] = self.evaluate(program, metric)
# Rank by score
ranked = sorted(results.items(), key=lambda x: x[1].score, reverse=True)
print("\n=== Comparison Results ===")
for rank, (name, result) in enumerate(ranked, 1):
print(f"{rank}. {name}: {result.score:.2%}")
return results
def export_report(self, program, output_path: str, metric=None):
"""Export detailed evaluation report."""
result = self.evaluate(program, metric)
report = {
"summary": {
"score": result.score,
"total": result.num_examples,
"correct": result.correct,
"accuracy": result.correct / result.num_examples
},
"config": {
"num_threads": self.num_threads,
"num_examples": len(self.devset)
}
}
with open(output_path, 'w') as f:
json.dump(report, f, indent=2)
logger.info(f"Report saved to {output_path}")
return report
# Usage
suite = EvaluationSuite(devset, num_threads=8)
# Single evaluation
result = suite.evaluate(my_program)
print(f"Score: {result.score:.2%}")
# Compare variants
results = suite.compare({
"baseline": baseline_program,
"optimized": optimized_program,
"finetuned": finetuned_program
})
Best Practices
- Hold out test data - Never optimize on evaluation set
- Multiple metrics - Combine correctness, quality, efficiency
- Statistical significance - Use enough examples (100+)
- Track over time - Version control evaluation results
Limitations
- Metrics are task-specific; no universal measure
- SemanticF1 requires LLM calls (cost)
- Parallel evaluation can hit rate limits
- Edge cases may not be captured
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.