Use when adding new error messages to React, or seeing "unknown error code" warnings.
npx skills add miles-knowbl/orchestrator --skill "calibration-tracker"
Install specific skill from multi-skill repository
# Description
Tracks estimate accuracy over time and adjusts future estimates based on historical data. Compares estimated vs actual effort, identifies systematic biases, and generates calibration adjustments. Enables increasingly accurate estimates as more data accumulates.
# SKILL.md
name: calibration-tracker
description: "Tracks estimate accuracy over time and adjusts future estimates based on historical data. Compares estimated vs actual effort, identifies systematic biases, and generates calibration adjustments. Enables increasingly accurate estimates as more data accumulates."
phase: COMPLETE
category: meta
version: "1.0.0"
depends_on: []
tags: [meta, calibration, estimation, metrics]
Calibration Tracker
Improve estimates through feedback.
When to Use
- After journey completion β Record actual vs estimated
- Before estimating β Load calibration adjustments
- Periodically β Analyze patterns across domains
- When estimates consistently off β Diagnose and adjust
Reference Requirements
MUST read before applying this skill:
| Reference | Why Required |
|---|---|
calibration-formulas.md |
Statistical methods for adjustment |
variance-analysis.md |
Root cause patterns for estimate variance |
Read if applicable:
| Reference | When Needed |
|---|---|
confidence-levels.md |
When interpreting sample size confidence |
Verification: Ensure calibration.json is updated with new data point.
Required Deliverables
| Deliverable | Location | Condition |
|---|---|---|
calibration.json |
domain-memory/{domain}/learning/ |
Always (create or update) |
Core Concept
Calibration Tracker answers: "How can we estimate more accurately next time?"
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CALIBRATION FEEDBACK LOOP β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ESTIMATE βββββββββββββββββββββββββββββββββββββββββββΆ ACTUAL β
β β β β
β β β β
β β βββββββββββββββββββββββββββ β β
β β β Calibration Tracker β β β
β β β β β β
β βββββββββββΆβ Compare & Analyze βββββββββββββββββ β
β β Generate Adjustments β β
β β Store History β β
β ββββββββββββββ¬βββββββββββββ β
β β β
β βΌ β
β FUTURE ESTIMATES β
β (with adjustments) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Calibration Data Model
Historical Record
{
"domain": "skills-library-mcp",
"records": [
{
"id": "rec-001",
"system": "Skills Library MCP",
"date": "2025-01-17",
"estimated": {
"complexity": "M",
"effortHours": 26,
"durationDays": 2,
"riskMultiplier": 1.2,
"confidence": "high",
"breakdown": {
"foundation": 4,
"state": 7.25,
"memory": 3.5,
"github": 3.5,
"polish": 5.5
}
},
"actual": {
"effortHours": 4.5,
"durationDays": 0.5,
"breakdown": {
"foundation": 0.5,
"state": 1.5,
"memory": 0.75,
"github": 0.5,
"polish": 1.25
}
},
"ratio": 0.17,
"factors": {
"agenticExecution": true,
"existingPatterns": true,
"clearRequirements": true,
"noBlockers": true
},
"notes": "Agentic continuous execution far faster than estimated human sprints"
}
]
}
Adjustment Model
{
"domain": "skills-library-mcp",
"lastUpdated": "2025-01-17",
"sampleSize": 1,
"adjustments": {
"global": {
"agenticMultiplier": 0.3,
"confidence": "low",
"basedOn": 1
},
"byComplexity": {
"S": { "multiplier": 1.0, "samples": 0 },
"M": { "multiplier": 0.3, "samples": 1 },
"L": { "multiplier": 1.0, "samples": 0 },
"XL": { "multiplier": 1.0, "samples": 0 }
},
"byCategory": {
"mcp": { "multiplier": 0.8, "samples": 1 },
"typescript": { "multiplier": 0.9, "samples": 1 },
"fileOperations": { "multiplier": 0.7, "samples": 1 }
},
"byPhase": {
"INIT": { "multiplier": 0.5, "samples": 1 },
"SCAFFOLD": { "multiplier": 0.3, "samples": 1 },
"IMPLEMENT": { "multiplier": 0.3, "samples": 1 },
"TEST": { "multiplier": 0.4, "samples": 1 },
"VERIFY": { "multiplier": 0.3, "samples": 1 },
"VALIDATE": { "multiplier": 0.3, "samples": 1 },
"DOCUMENT": { "multiplier": 0.4, "samples": 1 },
"REVIEW": { "multiplier": 0.3, "samples": 1 },
"SHIP": { "multiplier": 0.3, "samples": 1 }
}
}
}
The Calibration Process
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CALIBRATION PROCESS β
β β
β RECORD PHASE (After journey) β
β ββββββββββββββββββββββββββββ β
β β
β 1. CAPTURE ACTUALS β
β βββ Total hours from journey tracer β
β βββ Hours by phase β
β βββ Hours by skill β
β β
β 2. COMPARE TO ESTIMATE β
β βββ Overall ratio (actual / estimated) β
β βββ Phase-level ratios β
β βββ Identify largest variances β
β β
β 3. ANALYZE FACTORS β
β βββ What contributed to variance? β
β βββ Agentic vs human execution? β
β βββ Clear requirements vs ambiguity? β
β βββ Existing patterns vs novel? β
β β
β 4. UPDATE ADJUSTMENTS β
β βββ Weighted average with history β
β βββ Update confidence based on sample size β
β βββ Flag anomalies for review β
β β
β APPLY PHASE (Before estimating) β
β βββββββββββββββββββββββββββββββ β
β β
β 5. LOAD ADJUSTMENTS β
β βββ Read domain calibration data β
β βββ Check sample sizes for confidence β
β β
β 6. APPLY TO ESTIMATE β
β βββ Start with base estimate β
β βββ Apply relevant multipliers β
β βββ Document adjustments made β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Recording Actuals
Source of Truth: skillsLog
The actual durations come from loop-state.json's skillsLog field. Each skill invocation records:
{
"skill": "implement",
"reason": "Implement C1: Record metric events",
"startedAt": "2025-01-17T22:30:00Z",
"completedAt": "2025-01-17T23:15:00Z",
"durationMs": 2700000,
"status": "complete"
}
Extracting actuals:
// Sum all skill durations for total actual time
const totalMs = skillsLog.reduce((sum, entry) => {
const entryMs = entry.durationMs || 0;
const childMs = (entry.children || []).reduce((s, c) => s + (c.durationMs || 0), 0);
return sum + entryMs + childMs;
}, 0);
const totalHours = totalMs / 3600000;
// Group by skill for per-skill calibration
const bySkill = {};
skillsLog.forEach(entry => {
bySkill[entry.skill] = (bySkill[entry.skill] || 0) + entry.durationMs;
});
// Group by capability (from reason field) for per-capability calibration
const byCapability = {};
skillsLog.filter(e => e.reason.includes('C1:') || e.reason.includes('C2:')).forEach(...);
What counts:
- Skill execution time only (durationMs)
- Nested child skills are counted separately, not double-counted in parent
- Gate wait time is NOT included
- Human review time is NOT included
After Journey Completion
## Calibration Record: [System Name]
**Date:** [Date]
**Domain:** [Domain]
### Estimate (from ESTIMATE.md)
| Dimension | Value |
|-----------|-------|
| Complexity | [S/M/L/XL] |
| Effort | [X hours] |
| Duration | [Y days] |
| Risk Multiplier | [Z]x |
| Confidence | [High/Medium/Low] |
### Actual (from skillsLog)
| Dimension | Value |
|-----------|-------|
| Effort | [X hours] |
| Duration | [Y days] |
| Rework Cycles | [N] |
### Comparison
| Metric | Estimated | Actual | Ratio |
|--------|-----------|--------|-------|
| Total Hours | 26 | 4.5 | 0.17 |
| INIT Phase | 6 | 1.5 | 0.25 |
| SCAFFOLD Phase | 4 | 1 | 0.25 |
| IMPLEMENT Phase | 8 | 1.5 | 0.19 |
| TEST Phase | 2 | 0.5 | 0.25 |
| VERIFY Phase | 2 | 0.25 | 0.125 |
| VALIDATE Phase | 1 | 0.25 | 0.25 |
| DOCUMENT Phase | 1 | 0.5 | 0.5 |
| REVIEW Phase | 1 | 0.25 | 0.25 |
| SHIP Phase | 1 | 0.25 | 0.25 |
### Per-Skill Comparison
| Skill | Estimated | Actual | Ratio |
|-------|-----------|--------|-------|
| spec | 30m | 3m | 0.10 |
| estimation | 15m | 1m | 0.07 |
| architect | 60m | ? | ? |
| scaffold | 30m | ? | ? |
| implement | 300m | ? | ? |
| test-generation | 120m | ? | ? |
| code-verification | 30m | ? | ? |
### Contributing Factors
- [x] Agentic execution (continuous, no context switching)
- [x] Clear requirements (single system, well-defined)
- [x] Existing patterns (MCP SDK, TypeScript)
- [x] No blockers (no external dependencies)
- [ ] Novel domain (had prior knowledge)
- [ ] Complex integrations (simple file-based)
### Anomalies
- Estimate was for human developer with sprints
- Actual was agentic continuous execution
- Need separate calibration tracks for human vs agentic
### Adjustment Recommendation
| Factor | Current | Recommended | Confidence |
|--------|---------|-------------|------------|
| Agentic Global | N/A | 0.3x | Low (n=1) |
| Medium Complexity | 1.0x | 0.3x | Low (n=1) |
| MCP Category | N/A | 0.8x | Low (n=1) |
Applying Calibration
Before Estimating
## Calibration Check: [New System]
**Domain:** [Domain]
**Date:** [Date]
### Available Calibration Data
| Factor | Samples | Adjustment | Confidence |
|--------|---------|------------|------------|
| Global (Agentic) | 1 | 0.3x | Low |
| Complexity (M) | 1 | 0.3x | Low |
| Category (MCP) | 1 | 0.8x | Low |
### Raw Estimate
[From estimation skill]
- Base effort: 40 hours
- Risk multiplier: 1.2x
- **Raw total: 48 hours**
### Calibrated Estimate
[Apply adjustments]
- Raw: 48 hours
- Agentic adjustment: Γ 0.3 = 14.4 hours
- MCP adjustment: Γ 0.8 = 11.5 hours
- **Calibrated total: 12-15 hours**
### Confidence Note
Sample size is low (n=1). Calibrated estimate has high uncertainty.
Recommend tracking actuals closely to improve calibration.
Confidence Levels
| Sample Size | Confidence | Action |
|---|---|---|
| 0 | None | Use default multiplier (1.0x) |
| 1-2 | Low | Use with caution, wide range |
| 3-5 | Medium | Apply but verify |
| 6-10 | Good | Reliable for similar contexts |
| 10+ | High | Stable estimate |
Variance Analysis
When ratio is significantly off:
Underestimate (Actual > Estimated)
| Cause | Indicator | Fix |
|---|---|---|
| Hidden complexity | Many unknowns discovered | Add discovery phase |
| Scope creep | Requirements changed | Better requirements |
| Integration issues | External dependencies | Add integration buffer |
| Rework | Multiple iterations | Improve first-pass quality |
Overestimate (Actual < Estimated)
| Cause | Indicator | Fix |
|---|---|---|
| Agentic efficiency | Continuous execution | Agentic multiplier |
| Familiar patterns | Reused previous work | Pattern multiplier |
| Clear requirements | No ambiguity | Reduce uncertainty buffer |
| Good tooling | MCP, IDE support | Tool productivity factor |
File Locations
| File | Location | Purpose |
|---|---|---|
| Calibration data | domain-memory/{domain}/learning/calibration.json |
Historical records |
| Record template | domain-memory/{domain}/learning/calibration-records/ |
Individual records |
Integration
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CALIBRATION INTEGRATION β
β β
β estimation skill β
β β β
β ββββΆ Reads calibration.json β
β β βββ Applies relevant adjustments β
β β βββ Documents adjustments in ESTIMATE.md β
β β β
β journey-tracer β
β β β
β ββββΆ Provides actual hours β
β βββ Total and by-phase breakdown β
β β
β retrospective skill β
β β β
β ββββΆ Triggers calibration update β
β βββ Calls calibration-tracker with journey data β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Calibration Tracker Verification Checklist
## calibration-tracker Verification
### Record Phase
- [ ] Actual hours captured from journey
- [ ] Comparison to estimate documented
- [ ] Contributing factors identified
- [ ] Anomalies flagged
- [ ] Adjustment recommendations made
### Update Phase
- [ ] calibration.json updated
- [ ] Sample sizes incremented
- [ ] Confidence levels updated
- [ ] Anomalies documented for review
### Apply Phase
- [ ] Calibration data loaded before estimate
- [ ] Relevant adjustments applied
- [ ] Adjustments documented in estimate
- [ ] Confidence level noted
Mode-Specific Behavior
Calibration tracking behavior differs by orchestrator mode:
Greenfield Mode
| Aspect | Behavior |
|---|---|
| Scope | Full system builds from scratch |
| Approach | Track all phases comprehensively |
| Patterns | Establish baseline calibration data |
| Deliverables | Phase timing, capability estimates |
| Validation | Compare estimated vs actual per system |
| Constraints | Minimalβbuilding initial dataset |
Brownfield-Polish Mode
| Aspect | Behavior |
|---|---|
| Scope | Gap closure iterations, polish cycles |
| Approach | Track per-gap-type timing patterns |
| Patterns | Should match existing gap categories |
| Deliverables | Gap type baselines, rework frequency |
| Validation | Compare estimated vs actual per gap |
| Constraints | Must track by gap category |
Polish considerations:
- Track dark mode and responsive design time
- Measure deployment configuration overhead
- Account for test coverage improvement time
- Record rework frequency per gap type
Brownfield-Enterprise Mode
| Aspect | Behavior |
|---|---|
| Scope | Surgical changes, pattern conformance |
| Approach | Track change size vs time ratios |
| Patterns | Must include enterprise overhead factors |
| Deliverables | Enterprise multiplier, review cycle time |
| Validation | Compare change estimate vs actual |
| Constraints | Include approval wait time in estimates |
Enterprise constraints:
- Account for codebase analysis overhead
- Track pattern conformance verification time
- Measure multi-team coordination delays
- Include compliance/security review cycles
Mode-Specific Calibration Data
Store separate calibration tracks per mode:
{
"domain": "example-domain",
"byMode": {
"greenfield": {
"samples": 3,
"avgRatio": 0.35,
"confidence": "medium"
},
"brownfield-polish": {
"samples": 5,
"avgRatio": 0.45,
"confidence": "medium"
},
"brownfield-enterprise": {
"samples": 2,
"avgRatio": 0.60,
"confidence": "low"
}
}
}
β See references/calibration-formulas.md for statistical methods
β See references/variance-analysis.md for root cause patterns
References
| Reference | Description |
|---|---|
calibration-formulas.md |
Statistical methods for calculating calibration adjustments |
confidence-levels.md |
Interpreting sample size and confidence levels |
variance-analysis.md |
Root cause patterns for estimate variance |
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.