calibration-tracker

by @miles-knowbl in Development

# Install this skill:

npx skills add miles-knowbl/orchestrator --skill "calibration-tracker"

Install specific skill from multi-skill repository

# Description

Tracks estimate accuracy over time and adjusts future estimates based on historical data. Compares estimated vs actual effort, identifies systematic biases, and generates calibration adjustments. Enables increasingly accurate estimates as more data accumulates.

# SKILL.md

name: calibration-tracker
description: "Tracks estimate accuracy over time and adjusts future estimates based on historical data. Compares estimated vs actual effort, identifies systematic biases, and generates calibration adjustments. Enables increasingly accurate estimates as more data accumulates."
phase: COMPLETE
category: meta
version: "1.0.0"
depends_on: []
tags: [meta, calibration, estimation, metrics]

Calibration Tracker

Improve estimates through feedback.

When to Use

After journey completion — Record actual vs estimated
Before estimating — Load calibration adjustments
Periodically — Analyze patterns across domains
When estimates consistently off — Diagnose and adjust

Reference Requirements

MUST read before applying this skill:

Reference	Why Required
`calibration-formulas.md`	Statistical methods for adjustment
`variance-analysis.md`	Root cause patterns for estimate variance

Read if applicable:

Reference	When Needed
`confidence-levels.md`	When interpreting sample size confidence

Verification: Ensure calibration.json is updated with new data point.

Required Deliverables

Deliverable	Location	Condition
`calibration.json`	`domain-memory/{domain}/learning/`	Always (create or update)

Core Concept

Calibration Tracker answers: "How can we estimate more accurately next time?"

┌─────────────────────────────────────────────────────────────────────────────┐
│                      CALIBRATION FEEDBACK LOOP                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ESTIMATE ──────────────────────────────────────────▶ ACTUAL                │
│     │                                                    │                  │
│     │                                                    │                  │
│     │          ┌─────────────────────────┐               │                  │
│     │          │   Calibration Tracker   │               │                  │
│     │          │                         │               │                  │
│     └─────────▶│   Compare & Analyze     │◀──────────────┘                  │
│                │   Generate Adjustments  │                                  │
│                │   Store History         │                                  │
│                └────────────┬────────────┘                                  │
│                             │                                               │
│                             ▼                                               │
│                    FUTURE ESTIMATES                                         │
│                    (with adjustments)                                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Calibration Data Model

Historical Record

{
  "domain": "skills-library-mcp",
  "records": [
    {
      "id": "rec-001",
      "system": "Skills Library MCP",
      "date": "2025-01-17",
      "estimated": {
        "complexity": "M",
        "effortHours": 26,
        "durationDays": 2,
        "riskMultiplier": 1.2,
        "confidence": "high",
        "breakdown": {
          "foundation": 4,
          "state": 7.25,
          "memory": 3.5,
          "github": 3.5,
          "polish": 5.5
        }
      },
      "actual": {
        "effortHours": 4.5,
        "durationDays": 0.5,
        "breakdown": {
          "foundation": 0.5,
          "state": 1.5,
          "memory": 0.75,
          "github": 0.5,
          "polish": 1.25
        }
      },
      "ratio": 0.17,
      "factors": {
        "agenticExecution": true,
        "existingPatterns": true,
        "clearRequirements": true,
        "noBlockers": true
      },
      "notes": "Agentic continuous execution far faster than estimated human sprints"
    }
  ]
}

Adjustment Model

{
  "domain": "skills-library-mcp",
  "lastUpdated": "2025-01-17",
  "sampleSize": 1,
  "adjustments": {
    "global": {
      "agenticMultiplier": 0.3,
      "confidence": "low",
      "basedOn": 1
    },
    "byComplexity": {
      "S": { "multiplier": 1.0, "samples": 0 },
      "M": { "multiplier": 0.3, "samples": 1 },
      "L": { "multiplier": 1.0, "samples": 0 },
      "XL": { "multiplier": 1.0, "samples": 0 }
    },
    "byCategory": {
      "mcp": { "multiplier": 0.8, "samples": 1 },
      "typescript": { "multiplier": 0.9, "samples": 1 },
      "fileOperations": { "multiplier": 0.7, "samples": 1 }
    },
    "byPhase": {
      "INIT": { "multiplier": 0.5, "samples": 1 },
      "SCAFFOLD": { "multiplier": 0.3, "samples": 1 },
      "IMPLEMENT": { "multiplier": 0.3, "samples": 1 },
      "TEST": { "multiplier": 0.4, "samples": 1 },
      "VERIFY": { "multiplier": 0.3, "samples": 1 },
      "VALIDATE": { "multiplier": 0.3, "samples": 1 },
      "DOCUMENT": { "multiplier": 0.4, "samples": 1 },
      "REVIEW": { "multiplier": 0.3, "samples": 1 },
      "SHIP": { "multiplier": 0.3, "samples": 1 }
    }
  }
}

The Calibration Process

┌─────────────────────────────────────────────────────────────────────────────┐
│                      CALIBRATION PROCESS                                     │
│                                                                             │
│  RECORD PHASE (After journey)                                               │
│  ────────────────────────────                                               │
│                                                                             │
│  1. CAPTURE ACTUALS                                                         │
│     └─→ Total hours from journey tracer                                     │
│     └─→ Hours by phase                                                      │
│     └─→ Hours by skill                                                      │
│                                                                             │
│  2. COMPARE TO ESTIMATE                                                     │
│     └─→ Overall ratio (actual / estimated)                                  │
│     └─→ Phase-level ratios                                                  │
│     └─→ Identify largest variances                                          │
│                                                                             │
│  3. ANALYZE FACTORS                                                         │
│     └─→ What contributed to variance?                                       │
│     └─→ Agentic vs human execution?                                         │
│     └─→ Clear requirements vs ambiguity?                                    │
│     └─→ Existing patterns vs novel?                                         │
│                                                                             │
│  4. UPDATE ADJUSTMENTS                                                      │
│     └─→ Weighted average with history                                       │
│     └─→ Update confidence based on sample size                              │
│     └─→ Flag anomalies for review                                           │
│                                                                             │
│  APPLY PHASE (Before estimating)                                            │
│  ───────────────────────────────                                            │
│                                                                             │
│  5. LOAD ADJUSTMENTS                                                        │
│     └─→ Read domain calibration data                                        │
│     └─→ Check sample sizes for confidence                                   │
│                                                                             │
│  6. APPLY TO ESTIMATE                                                       │
│     └─→ Start with base estimate                                            │
│     └─→ Apply relevant multipliers                                          │
│     └─→ Document adjustments made                                           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Recording Actuals

Source of Truth: skillsLog

The actual durations come from loop-state.json's skillsLog field. Each skill invocation records:

{
  "skill": "implement",
  "reason": "Implement C1: Record metric events",
  "startedAt": "2025-01-17T22:30:00Z",
  "completedAt": "2025-01-17T23:15:00Z",
  "durationMs": 2700000,
  "status": "complete"
}

Extracting actuals:

// Sum all skill durations for total actual time
const totalMs = skillsLog.reduce((sum, entry) => {
  const entryMs = entry.durationMs || 0;
  const childMs = (entry.children || []).reduce((s, c) => s + (c.durationMs || 0), 0);
  return sum + entryMs + childMs;
}, 0);
const totalHours = totalMs / 3600000;

// Group by skill for per-skill calibration
const bySkill = {};
skillsLog.forEach(entry => {
  bySkill[entry.skill] = (bySkill[entry.skill] || 0) + entry.durationMs;
});

// Group by capability (from reason field) for per-capability calibration
const byCapability = {};
skillsLog.filter(e => e.reason.includes('C1:') || e.reason.includes('C2:')).forEach(...);

What counts:
- Skill execution time only (durationMs)
- Nested child skills are counted separately, not double-counted in parent
- Gate wait time is NOT included
- Human review time is NOT included

After Journey Completion

## Calibration Record: [System Name]

**Date:** [Date]
**Domain:** [Domain]

### Estimate (from ESTIMATE.md)

| Dimension | Value |
|-----------|-------|
| Complexity | [S/M/L/XL] |
| Effort | [X hours] |
| Duration | [Y days] |
| Risk Multiplier | [Z]x |
| Confidence | [High/Medium/Low] |

### Actual (from skillsLog)

| Dimension | Value |
|-----------|-------|
| Effort | [X hours] |
| Duration | [Y days] |
| Rework Cycles | [N] |

### Comparison

| Metric | Estimated | Actual | Ratio |
|--------|-----------|--------|-------|
| Total Hours | 26 | 4.5 | 0.17 |
| INIT Phase | 6 | 1.5 | 0.25 |
| SCAFFOLD Phase | 4 | 1 | 0.25 |
| IMPLEMENT Phase | 8 | 1.5 | 0.19 |
| TEST Phase | 2 | 0.5 | 0.25 |
| VERIFY Phase | 2 | 0.25 | 0.125 |
| VALIDATE Phase | 1 | 0.25 | 0.25 |
| DOCUMENT Phase | 1 | 0.5 | 0.5 |
| REVIEW Phase | 1 | 0.25 | 0.25 |
| SHIP Phase | 1 | 0.25 | 0.25 |

### Per-Skill Comparison

| Skill | Estimated | Actual | Ratio |
|-------|-----------|--------|-------|
| spec | 30m | 3m | 0.10 |
| estimation | 15m | 1m | 0.07 |
| architect | 60m | ? | ? |
| scaffold | 30m | ? | ? |
| implement | 300m | ? | ? |
| test-generation | 120m | ? | ? |
| code-verification | 30m | ? | ? |

### Contributing Factors

- [x] Agentic execution (continuous, no context switching)
- [x] Clear requirements (single system, well-defined)
- [x] Existing patterns (MCP SDK, TypeScript)
- [x] No blockers (no external dependencies)
- [ ] Novel domain (had prior knowledge)
- [ ] Complex integrations (simple file-based)

### Anomalies

- Estimate was for human developer with sprints
- Actual was agentic continuous execution
- Need separate calibration tracks for human vs agentic

### Adjustment Recommendation

| Factor | Current | Recommended | Confidence |
|--------|---------|-------------|------------|
| Agentic Global | N/A | 0.3x | Low (n=1) |
| Medium Complexity | 1.0x | 0.3x | Low (n=1) |
| MCP Category | N/A | 0.8x | Low (n=1) |

Applying Calibration

Before Estimating

## Calibration Check: [New System]

**Domain:** [Domain]
**Date:** [Date]

### Available Calibration Data

| Factor | Samples | Adjustment | Confidence |
|--------|---------|------------|------------|
| Global (Agentic) | 1 | 0.3x | Low |
| Complexity (M) | 1 | 0.3x | Low |
| Category (MCP) | 1 | 0.8x | Low |

### Raw Estimate

[From estimation skill]
- Base effort: 40 hours
- Risk multiplier: 1.2x
- **Raw total: 48 hours**

### Calibrated Estimate

[Apply adjustments]
- Raw: 48 hours
- Agentic adjustment: × 0.3 = 14.4 hours
- MCP adjustment: × 0.8 = 11.5 hours
- **Calibrated total: 12-15 hours**

### Confidence Note

Sample size is low (n=1). Calibrated estimate has high uncertainty.
Recommend tracking actuals closely to improve calibration.

Confidence Levels

Sample Size	Confidence	Action
0	None	Use default multiplier (1.0x)
1-2	Low	Use with caution, wide range
3-5	Medium	Apply but verify
6-10	Good	Reliable for similar contexts
10+	High	Stable estimate

Variance Analysis

When ratio is significantly off:

Underestimate (Actual > Estimated)

Cause	Indicator	Fix
Hidden complexity	Many unknowns discovered	Add discovery phase
Scope creep	Requirements changed	Better requirements
Integration issues	External dependencies	Add integration buffer
Rework	Multiple iterations	Improve first-pass quality

Overestimate (Actual < Estimated)

Cause	Indicator	Fix
Agentic efficiency	Continuous execution	Agentic multiplier
Familiar patterns	Reused previous work	Pattern multiplier
Clear requirements	No ambiguity	Reduce uncertainty buffer
Good tooling	MCP, IDE support	Tool productivity factor

File Locations

File	Location	Purpose
Calibration data	`domain-memory/{domain}/learning/calibration.json`	Historical records
Record template	`domain-memory/{domain}/learning/calibration-records/`	Individual records

Integration

┌─────────────────────────────────────────────────────────────────────────────┐
│                      CALIBRATION INTEGRATION                                 │
│                                                                             │
│  estimation skill                                                           │
│       │                                                                     │
│       ├──▶ Reads calibration.json                                           │
│       │    └─→ Applies relevant adjustments                                 │
│       │    └─→ Documents adjustments in ESTIMATE.md                         │
│       │                                                                     │
│  journey-tracer                                                             │
│       │                                                                     │
│       └──▶ Provides actual hours                                            │
│            └─→ Total and by-phase breakdown                                 │
│                                                                             │
│  retrospective skill                                                        │
│       │                                                                     │
│       └──▶ Triggers calibration update                                      │
│            └─→ Calls calibration-tracker with journey data                  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Calibration Tracker Verification Checklist

## calibration-tracker Verification

### Record Phase
- [ ] Actual hours captured from journey
- [ ] Comparison to estimate documented
- [ ] Contributing factors identified
- [ ] Anomalies flagged
- [ ] Adjustment recommendations made

### Update Phase
- [ ] calibration.json updated
- [ ] Sample sizes incremented
- [ ] Confidence levels updated
- [ ] Anomalies documented for review

### Apply Phase
- [ ] Calibration data loaded before estimate
- [ ] Relevant adjustments applied
- [ ] Adjustments documented in estimate
- [ ] Confidence level noted

Mode-Specific Behavior

Calibration tracking behavior differs by orchestrator mode:

Greenfield Mode

Aspect	Behavior
Scope	Full system builds from scratch
Approach	Track all phases comprehensively
Patterns	Establish baseline calibration data
Deliverables	Phase timing, capability estimates
Validation	Compare estimated vs actual per system
Constraints	Minimal—building initial dataset

Brownfield-Polish Mode

Aspect	Behavior
Scope	Gap closure iterations, polish cycles
Approach	Track per-gap-type timing patterns
Patterns	Should match existing gap categories
Deliverables	Gap type baselines, rework frequency
Validation	Compare estimated vs actual per gap
Constraints	Must track by gap category

Polish considerations:
- Track dark mode and responsive design time
- Measure deployment configuration overhead
- Account for test coverage improvement time
- Record rework frequency per gap type

Brownfield-Enterprise Mode

Aspect	Behavior
Scope	Surgical changes, pattern conformance
Approach	Track change size vs time ratios
Patterns	Must include enterprise overhead factors
Deliverables	Enterprise multiplier, review cycle time
Validation	Compare change estimate vs actual
Constraints	Include approval wait time in estimates

Enterprise constraints:
- Account for codebase analysis overhead
- Track pattern conformance verification time
- Measure multi-team coordination delays
- Include compliance/security review cycles

Mode-Specific Calibration Data

Store separate calibration tracks per mode:

{
  "domain": "example-domain",
  "byMode": {
    "greenfield": {
      "samples": 3,
      "avgRatio": 0.35,
      "confidence": "medium"
    },
    "brownfield-polish": {
      "samples": 5,
      "avgRatio": 0.45,
      "confidence": "medium"
    },
    "brownfield-enterprise": {
      "samples": 2,
      "avgRatio": 0.60,
      "confidence": "low"
    }
  }
}

→ See references/calibration-formulas.md for statistical methods
→ See references/variance-analysis.md for root cause patterns

References

Reference	Description
`calibration-formulas.md`	Statistical methods for calculating calibration adjustments
`confidence-levels.md`	Interpreting sample size and confidence levels
`variance-analysis.md`	Root cause patterns for estimate variance

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.