Use when adding new error messages to React, or seeing "unknown error code" warnings.
npx skills add akrimm702/autoforge
Or install specific skill: npx add-skill https://github.com/akrimm702/autoforge
# Description
Autonomous iterative optimization for code, skills, prompts, and repositories. Top-agent orchestrates with mathematical convergence. Four modes: prompt (mental simulation), code (real execution + tests), audit (CLI testing + doc fixing), project (whole-repo optimization). Use when: user says "autoforge", "forge", "optimize skill", "improve", "run autoforge", "optimize code", "improve script", "optimize repo", "forge project", "check project", "repo audit".
# SKILL.md
name: autoforge
description: 'Autonomous iterative optimization for code, skills, prompts, and repositories. Top-agent orchestrates with mathematical convergence. Four modes: prompt (mental simulation), code (real execution + tests), audit (CLI testing + doc fixing), project (whole-repo optimization). Use when: user says "autoforge", "forge", "optimize skill", "improve", "run autoforge", "optimize code", "improve script", "optimize repo", "forge project", "check project", "repo audit".'
AutoForge — Top-Agent Architecture
Overview
Agent (you)
├── State: results.tsv, current target file state, iteration counter
├── Iteration 1: evaluate → improve → write TSV → report
├── Iteration 2: evaluate → improve → write TSV → report
├── ...
└── Finish: report.sh --final → configured channel
Sub-Agent = You
"Sub-Agent" is a conceptual role, not a separate process. You (the top-agent) execute each iteration yourself: simulate/execute → evaluate → write TSV → call report.sh. The templates below describe what you do PER ITERATION — not what you send to another agent.
For code execution (mode: code), use exec directly.
Multi-Model Setup (recommended for Deep Audits)
For complex audits, you can split two roles across different models:
| Role | Model | Task |
|---|---|---|
| Optimizer | Opus / GPT-4.1 | Analyzes, finds issues, writes fixes |
| Validator | GPT-5 / Gemini (different model) | Checks against ground truth, provides pass rate |
Flow: Optimizer and Validator alternate. Optimizer iterations have status improved/retained/discard. Validator iterations confirm or refute the pass rate. Spawn validators as sub-agents with sessions_spawn and explicit model.
When to use Multi-Model: Deep Audits (>5 iterations expected), complex ground truth, or when a single model is blind to its own errors.
When Single-Model suffices: Simple CLI audits, prompt optimization, code with clear tests.
Configuration
AutoForge uses environment variables for reporting. All are optional — without them, output goes to stdout.
| Variable | Default | Description |
|---|---|---|
AF_CHANNEL |
telegram |
Messaging channel for reports |
AF_CHAT_ID |
(none) | Chat/group ID for report delivery |
AF_TOPIC_ID |
(none) | Thread/topic ID within the chat |
Hard Invariants
These rules apply always, regardless of mode:
- TSV is mandatory. Every iteration writes exactly one row to
results/[target]-results.tsv. - Reporting is mandatory. Call
report.shimmediately after every TSV row. - --dry-run never overwrites the target. Only TSV,
*-proposed.md, and reports are written. - Mode isolation is strict. Only execute steps for the assigned mode.
- Iteration 1 = Baseline. Evaluate the original version unchanged, status
baseline.
Modes — Read ONLY Your Mode!
You are assigned ONE mode. Ignore all sections for other modes.
| Mode | What happens | Output |
|---|---|---|
prompt |
Mentally simulate skill/prompt, evaluate against evals | Improved prompt text |
code |
Execute code in sandbox, measure tests | Improved code |
audit |
Test CLI commands (read-only only!) + verify SKILL.md against reality | Improved SKILL.md |
project |
Scan whole repo, cross-file analysis, fix multiple files per iteration | Improved repository |
Your mode is in the task prompt. Everything else is irrelevant to you.
TSV Format (same for ALL modes)
Header (once at loop start):
printf '%s\t%s\t%s\t%s\t%s\n' "iteration" "prompt_version_summary" "pass_rate" "change_description" "status" > results/[target]-results.tsv
Row per iteration:
printf '%s\t%s\t%s\t%s\t%s\n' "1" "Baseline" "58%" "Original version" "baseline" >> results/[target]-results.tsv
Use
printfnotecho -e!echo -einterprets backslashes in field values.printf '%s'outputs strings literally.
5 columns, TAB-separated, EXACTLY this order:
| # | Column | Type | Rules |
|---|---|---|---|
| 1 | iteration |
Integer | 1, 2, 3, ... |
| 2 | prompt_version_summary |
String | Max 50 Unicode chars. No tabs, no newlines. |
| 3 | pass_rate |
String | Number + %: 58%, 92%, 100%. Always integer. |
| 4 | change_description |
String | Max 100 Unicode chars. No tabs, no newlines. |
| 5 | status |
Enum | Exactly one of: baseline · improved · retained · discard |
Escaping rules:
- Tabs in text fields → replace with spaces
- Newlines in text fields → replace with
| - Empty fields → use hyphen
-(never leave empty) $and backticks → useprintf '%s'or escape with\$(shell expansion risk!)- Unicode/Emoji allowed, count as 1 character (not bytes)
Status rules (based on pass-rate comparison):
baseline— Mandatory for Iteration 1. Evaluate original version only.improved— Pass rate higher than previous best → new version becomes current stateretained— Pass rate equal or marginally better → predecessor remainsdiscard— Pass rate lower → change discarded, revert to best state
Reporting (same for ALL modes)
After EVERY TSV row (including baseline):
bash scripts/report.sh results/[target]-results.tsv "[Skill Name]"
After loop ends, additionally with --final:
bash scripts/report.sh results/[target]-results.tsv "[Skill Name]" --final
The report script reads AF_CHANNEL, AF_CHAT_ID, and AF_TOPIC_ID from environment. Without them, it prints to stdout with ANSI colors.
Stop Conditions (for ALL modes)
Priority — first matching condition wins, top to bottom:
- 🛑 Minimum iterations — If specified in task (e.g. "min 5"), this count MUST be reached. No other condition can stop before.
- 🛑 Max 30 iterations — Hard safety net, stop immediately.
- ❌ 3×
discardin a row → structural problem, stop + analyze. - ✅ 3× 100% pass rate (after minimum) → confirmed perfect, done.
- ➡️ 5×
retainedin a row → converged, done.
Counting rules:
3× 100%= three iterations withpass_rate == 100%, not necessarily consecutive.5× retainedand3× discard= consecutive (in a row).baselinecounts toward no series.improvedinterruptsretainedanddiscardseries.
At 100% in early iterations: Keep going! Test harder edge cases. Only 3× 100% after the minimum confirms true perfection.
Recognizing Validator Noise
In multi-model setups, the Validator can produce false positives — fails that aren't real issues:
- Config path vs tool name confusion (e.g.
agents.list[]≠agents_listtool) - Inverted checks ("no X" → Validator looks for X as required)
- Normal English as forbidden reference (e.g. "runtime outcome" ≠
runtime: "acp") - Overcounting (thread commands counted as subagent commands)
Rule: If after all real fixes >3 discards come in a row and the fail justifications don't hold up under scrutiny → declare convergence, don't validate endlessly.
Execution Modes
| Flag | Behavior |
|---|---|
--dry-run (default) |
Only TSV + proposed files. Target file/repo remains unchanged. |
--live |
Target file/repo is overwritten. Auto-backup → results/backups/ |
--resume |
Read existing TSV, continue from last iteration. On invalid format: abort. |
mode: prompt
Only read if your task contains
mode: prompt!
Per Iteration: What you do
- Read current prompt/skill
- Mentally simulate 5 different realistic scenarios
- Evaluate each scenario against all evals (Yes=1, No=0)
- Pass rate = (Sum Yes) / (Eval count × 5 scenarios) × 100
- Compare with best previous pass rate → determine status
- On
improved: propose minimal, surgical improvement - Write TSV row + call report.sh
- Check stop conditions
At the End
Best version → results/[target]-proposed.md + report.sh --final
mode: code
Only read if your task contains
mode: code!
Per Iteration: What you do
- Create sandbox:
SCRATCH=$(mktemp -d) && cd $SCRATCH - Write current code to sandbox
- Execute test command (with
timeout 60s) - Measure: exit_code, stdout, stderr, runtime
- Evaluate against evals → calculate pass rate
- On
improved: minimal code improvement + verify again - Write TSV row + call report.sh
- Check stop conditions
Code Eval Types
| Eval Type | Description | Example |
|---|---|---|
exit_code |
Process exit code | exit_code == 0 |
output_contains |
stdout contains string | "SUCCESS" in stdout |
output_matches |
stdout matches regex | r"Total: \d+" |
test_pass |
Test framework green | pytest exit 0 |
runtime |
Runtime limit | < 5000ms |
no_stderr |
No error output | stderr == "" |
file_exists |
Output file created | result.json exists |
json_valid |
Output is valid JSON | json.loads(stdout) |
At the End
Best code → results/[target]-proposed.[ext] + report.sh --final
mode: audit
Only read if your task contains
mode: audit!
⚠️ DO NOT write or execute your own code. Only test CLI commands of the target tool (--help + read-only).
Two Variants
Simple Audit (CLI skill, clear commands):
- 2 iterations: Baseline → Proposed Fix
- For tools with clear --help output and simple command structure
Deep Audit (complex docs, many checks):
- Iterative loop like prompt/code, same stop conditions
- For extensive documentation with many checkpoints (e.g. config keys, tool policy, parameter lists)
- Recommended: Multi-Model setup (Opus Optimizer + external Validator)
Simple Audit Flow
- Write TSV header
- Iteration 1 (Baseline): Test every documented command → pass rate → TSV + report
- Iteration 2 (Proposed Fix): Write improved SKILL.md → expected pass rate → TSV + report
- Improved SKILL.md →
results/[target]-proposed.md - Detail results →
results/[target]-audit-details.md(NOT in TSV!) - report.sh
--final
Deep Audit Flow
- Write TSV header
- Iteration 1 (Baseline): Extract ground truth from source, define all checks, evaluate baseline
- Iterations 2+: Optimizer fixes issues → Validator checks → TSV + report per iteration
- Loop runs until stop conditions trigger (3× 100%, 5× retained, 3× discard)
- Final version →
results/[target]-proposed.mdorresults/[target]-v1.md - report.sh
--final
Fixed Evals (audit)
- Completeness — Does SKILL.md cover ≥80% of real commands/config?
- Correctness — Are ≥90% of documented commands/params syntactically correct?
- No stale references — Does everything documented actually exist?
- No missing core features — Are all important features covered?
- Workflow quality — Does quick-start actually work?
mode: project
Only read if your task contains
mode: project!
⚠️ This mode operates on an ENTIRE repository/directory, not a single file. Cross-file consistency is the core feature — this is NOT "audit on many files."
Three Phases
Project mode runs through three sequential phases. Phases 1 and 2 happen once (in Iteration 1 = Baseline). Phase 3 is the iterative fix loop.
Phase 1: Scan & Plan
- Analyze the repo directory:
bash # Discover structure tree -L 3 --dirsfirst [target_dir] ls -la [target_dir] - Identify relevant files and classify by priority:
| Priority | Files |
|---|---|
| critical | README, Dockerfile, CI workflows (.github/workflows), package.json/requirements.txt, main entry points |
| normal | Tests, configs, scripts, .env.example, .gitignore |
| low | Docs, examples, LICENSE, CHANGELOG |
- Build the File-Map — a mental inventory of what exists and what's missing.
- Compose eval set: Merge user-provided evals with auto-detected evals (see Default Evals below).
Phase 2: Cross-File Analysis
Run consistency checks across files. Each check = one eval point:
| Check | What it verifies |
|---|---|
| README ↔ CLI | Documented commands/flags match actual --help output |
| Dockerfile ↔ deps | requirements.txt / package.json versions match what Dockerfile installs |
| CI ↔ project structure | Workflow references correct paths, scripts, test commands |
.env.example ↔ code |
Every env var in code has a corresponding entry in .env.example |
| Imports ↔ dependencies | Every import / require has a matching dependency declaration |
| Tests ↔ source | Test files exist for critical modules |
.gitignore ↔ artifacts |
Build outputs, secrets, and caches are excluded |
Result of Phase 2: A complete eval checklist with per-file and cross-file checks, each scored Yes/No.
Phase 3: Iterative Fix Loop
Same loop logic as prompt/code/audit — TSV, report.sh, stop conditions. Key differences:
- Multiple files can be changed per iteration
- Pass rate = aggregated over ALL evals (file-specific + cross-file)
- Fixes are minimal and surgical — don't refactor blindly, only fix what improves pass rate
change_descriptionincludes which files were touched:"Fix Dockerfile + CI workflow sync"
Per Iteration: What you do
- Evaluate current repo state against all evals (file-specific + cross-file)
- Calculate pass rate: (passing evals / total evals) × 100
- Compare with best previous pass rate → determine status
- On
improved: apply minimal, surgical fixes to the fewest files necessary - Verify the fix didn't break other evals (re-run affected checks)
- Write TSV row + call report.sh
- Check stop conditions
Dry-Run vs Live
| Flag | Behavior |
|---|---|
--dry-run (default) |
Fixed files → results/[target]-proposed/ directory (mirrors repo structure). Original repo untouched. |
--live |
Files overwritten in-place. Originals backed up → results/backups/ (preserving directory structure). |
Default Evals (auto-applied unless overridden)
These evals are automatically used when the user doesn't provide custom evals. The agent detects which are applicable based on what exists in the repo:
| # | Eval | Condition |
|---|---|---|
| 1 | README accurate? (describes actual features/commands) | README exists |
| 2 | Tests present and green? (pytest / npm test / go test) |
Test files or test config detected |
| 3 | CI configured and syntactically correct? | .github/workflows/ or .gitlab-ci.yml exists |
| 4 | No hardcoded secrets? (grep -rE "(password|api_key|token|secret)\s*=") |
Always |
| 5 | Dependencies complete? (requirements.txt ↔ imports, package.json ↔ requires) |
Dependency file exists |
| 6 | Dockerfile functional? (docker build succeeds or Dockerfile syntax valid) |
Dockerfile exists |
| 7 | .gitignore sensible? (no secrets, build artifacts excluded) |
.gitignore exists |
| 8 | License present? | Always |
Eval Scoring
Pass Rate = (Passing Evals / Total Applicable Evals) × 100
Evals that don't apply (e.g. "Dockerfile functional?" when no Dockerfile exists) are excluded from the total, not counted as passes.
At the End
--dry-run: All proposed changes →results/[target]-proposed/directory--live: Changes already applied, backups inresults/backups/- report.sh
--final - Optionally:
results/[target]-project-details.mdwith per-file findings (NOT in TSV!)
Directory Structure
autoforge/
├── SKILL.md ← This file
├── results/
│ ├── [target]-results.tsv ← TSV logs
│ ├── [target]-proposed.md ← Proposed improvement (prompt/audit)
│ ├── [target]-proposed/ ← Proposed repo changes (project mode)
│ │ ├── README.md
│ │ ├── Dockerfile
│ │ └── ...
│ ├── [target]-v1.md ← Deep audit final version
│ ├── [target]-audit-details.md ← Audit details (audit mode only)
│ ├── [target]-project-details.md ← Project details (project mode only)
│ └── backups/ ← Auto-backups (--live)
│ ├── [file].bak ← Single file backups (prompt/code/audit)
│ └── [target]-backup/ ← Full directory backup (project mode)
├── scripts/
│ ├── report.sh ← Channel reporting
│ └── visualize.py ← PNG chart (optional)
├── references/
│ ├── eval-examples.md ← Pre-built evals
│ └── ml-mode.md ← ML training guide
└── examples/
├── demo-results.tsv ← Demo data
└── example-config.json ← Example configuration
Examples (task descriptions, NOT CLI commands)
AutoForge is not a CLI tool — it's a skill prompt for the agent:
# Optimize a prompt
"Start autoforge mode: prompt for the coding-agent skill.
Evals: PTY correct? Workspace protected? Clearly structured?"
# Audit a CLI skill (simple)
"Start autoforge mode: audit for notebooklm-py."
# Deep audit with multi-model
"Start autoforge mode: audit (deep) for subagents docs.
Optimizer: Opus, Validator: GPT-5
Extract ground truth from source, validate iteratively."
# Optimize code
"Start autoforge mode: code for backup.sh.
File: ./backup.sh
Test: bash backup.sh personal --dry-run
Evals: exit_code==0, backup file created, < 10s runtime"
# Optimize a whole repository
"Start autoforge mode: project for ./my-app
Evals: Tests green? CI correct? No hardcoded secrets? README accurate?"
# Project mode with custom focus
"Start autoforge mode: project for /path/to/api-server
Focus: Docker + CI pipeline consistency
Evals: docker build succeeds, CI workflow references correct paths,
.env.example covers all env vars used in code"
# Project mode dry-run (default)
"Start autoforge mode: project for ./my-tool --dry-run
Use default evals. Show me what needs fixing."
Eval Examples → Mode Mapping
references/eval-examples.md provides ready-to-use Yes/No evals grouped by category. Here's how they map to AutoForge modes:
| eval-examples.md Category | AutoForge Mode | Notes |
|---|---|---|
| Briefing, Email, Calendar, Summary, Proposal | prompt |
Mental simulation with scenario evals |
| Python Script, Shell Script, API, Data Pipeline, Build | code |
Real execution with measurable criteria |
| CI/CD, Docker, Helm, Kubernetes, Terraform | code or project |
code for single files, project for cross-file |
| Code Review, API Documentation | audit |
Verify docs match reality |
| Project / Repository, Cross-File Consistency, Security Baseline | project |
Whole-repo scanning and cross-file checks |
Pick evals from the matching category and paste them into your task prompt as the eval set.
Tips
- Always start with
--dry-run prompt= think,code= execute,audit= test CLI,project= optimize repo- Simple Audit for clear CLI skills, Deep Audit for complex docs
- Project mode scans the whole repo — cross-file consistency is the killer feature
- Multi-Model for Deep Audits: different models cover different blind spots
- At >3 discards after all fixes: check for validator noise, declare convergence if justified
- TSV + report.sh are NOT optional — they are the user interface
- For ML training: see
references/ml-mode.md
# README.md
Most "self-improving agent" approaches boil down to "reflect on your output."
That's a vibe check, not optimization. AutoForge is different.
| Typical "Reflect" | AutoForge | |
|---|---|---|
| When to stop | "Looks good to me" | 3× 100% pass, 5× retained, or 3× discard |
| Progress | Chat history | TSV with pass rates & diffs per iteration |
| Validation | Same model checks itself | Multi-model cross-validation |
| Reporting | Final summary | Live Unicode bars after every iteration |
| Modes | One generic loop | 4 specialized modes |
| Track record | Demo | 50+ iterations across 6 production skills |
🏗️ Architecture
Core Loop
┌─────────────────────┐
│ Define target + │
│ evals │
└─────────┬───────────┘
│
▼
┌─────────────────────┐
│ Baseline scan │
└─────────┬───────────┘
│
┌──────────────┘
│
▼
┌───────────────┐ proposed ┌───────────────┐
┌──▶│ Optimizer │────────────────▶│ Validator │
│ │ (Claude Opus)│ │ (GPT-5) │
│ └───────────────┘ └───────┬───────┘
│ │
│ ┌───────────────┴───────────────┐
│ │ │
│ ▼ ▼
│ ┌─────────────┐ ┌─────────────┐
│ │ ✅ improved │ │ ❌ discarded │
│ └──────┬──────┘ └──────┬──────┘
│ │ │
│ └───────────┬─────────────┬─────┘
│ ▼ │
│ ┌─────────────┐ │
│ │ Log to TSV │ │
│ │ + report.sh │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ No │ Converged? │ │
└──────────────────────────────┤ │ │
└──────┬───────┘ │
│ │
┌───────────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌──────────────┐ ┌───────────┐
│ 3× 100% ✅ │ │ 5× retained │ │ 3× discard│
│ Deploy! │ │ Converged │ │ Stop ⚠️ │
└─────────────┘ └──────────────┘ └───────────┘
Multi-Model Validation
Iter 1 (Optimizer) Iter 2 (Validator) Iter 3 (Optimizer)
───────────────── ────────────────── ─────────────────
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Claude Opus │ │ GPT-5 │ │ Claude Opus │
│ │ │ │ │ │
│ Analyze │ │ Blind review │ │ Fix findings │
│ Find issues │ │ of output │ │ from GPT-5 │
│ Write fixes │ │ (no context) │ │ │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
▼ ▼ ▼
pass_rate: 62% pass_rate: 78% pass_rate: 95%
status: improved status: improved status: improved
└────── TSV ──────────────── TSV ──────────────── TSV ──────┘
Different model validates → no "grading your own homework" blind spot
Project Mode — Three Phases
Phase 1 Phase 2 Phase 3
SCAN & PLAN CROSS-FILE ANALYSIS ITERATIVE FIX LOOP
───────────── ─────────────────── ──────────────────
┌──────────────┐ ┌──────────────────┐ ┌──────────────┐
│ Walk repo │ │ README ↔ CLI │ ┌───▶│ Surgical fix │
│ tree │ │ Dockerfile ↔ deps│ │ │ across files │
│ │──────────▶│ CI ↔ scripts │───▶│ └──────┬───────┘
│ Build file │ │ .env ↔ code refs │ │ │
│ priority map │ │ imports ↔ reqs │ │ ▼
└──────────────┘ │ .gitignore ↔ out │ │ ┌──────────────┐
└──────────────────┘ │ │ Validate │
│ │ consistency │
│ └──────┬───────┘
│ │
│ Not │ Done?
└─── yet ◀──┘
🚀 Quick Start
Install
# Via ClawHub
clawhub install autoforge
# Or clone
git clone https://github.com/akrimm702/autoforge.git
cp -r autoforge ~/.openclaw/workspace/skills/autoforge
Configure reporting (optional)
export AF_CHANNEL="telegram" # telegram | discord | slack
export AF_CHAT_ID="-100XXXXXXXXXX" # chat/group ID
export AF_TOPIC_ID="1234" # thread ID (optional)
No env vars? Reports print to stdout with ANSI colors.
Tell your agent
Start autoforge mode: prompt for the coding-agent skill.
Evals: PTY handling correct? Workspace protection enforced? Clear structure?
The agent reads the skill, runs the loop, tracks everything in TSV, reports live, and stops when convergence math says it's done.
🔧 Four Modes
prompt — Mental Simulation
Simulates 5 realistic scenarios per iteration, evaluates Yes/No against defined evals, calculates pass rate mathematically. No code execution.
Best for: SKILL.md files, prompt engineering, documentation, briefing templates.
code — Real Execution
Runs code in a sandbox, measures exit codes, stdout, stderr, runtime. Evaluates against concrete test criteria.
Best for: Shell scripts, Python tools, data pipelines, build systems.
audit — CLI Testing
Tests documented commands against actual CLI behavior (--help, read-only). Catches docs-vs-reality drift. Two variants: Simple (2 iterations) or Deep (iterative with multi-model).
Best for: Verifying skill documentation matches real CLI behavior.
project — Whole Repository ⭐
Scans an entire repo, builds a file-map with priorities, runs cross-file consistency checks, and iteratively fixes issues across multiple files.
Best for: README ↔ CLI drift, Dockerfile ↔ dependency mismatches, CI ↔ project structure gaps.
Cross-file checks include:
- README documents what the CLI actually does
- Dockerfile installs the right dependency versions
- CI workflows reference correct paths and scripts
.env.examplecovers all env vars used in code- Every import has a matching dependency declaration
.gitignoreexcludes build artifacts and secrets
📊 Live Reporting
After each iteration, report.sh sends live updates:
📊 AutoForge: coding-agent
📍 Iter 1 █████████░░░░░░░░░░░ 45%
✅ Iter 2 ████████████░░░░░░░░ 62%
✅ Iter 3 ███████████████░░░░░ 78%
✅ Iter 4 █████████████████░░░ 85%
✅ Iter 5 ██████████████████░░ 90%
✅ Iter 6 ██████████████████░░ 92%
✅ Iter 7 ██████████████████░░ 92%
✅ Iter 8 ███████████████████░ 95%
✅ Iter 9 ███████████████████░ 95%
✅ Iter 10 ████████████████████ 100%
──────────────────────
Iterations: 10 ✅ Keep: 10 ❌ Discard: 0
🏆 Best pass rate: 100% (Iter 10)
✅ Loop converged — improvement found
Every iteration is tracked in TSV:
iteration prompt_version_summary pass_rate change_description status
1 Baseline 45% Original SKILL.md baseline baseline
2 Add missing subcommands 62% 16 Codex subcommands added improved
3 Fix approval flags 78% Scoped flags by context improved
...
10 Validation pass 100% All checks green improved
📐 Convergence Rules
No vibes. No "looks good." Mathematical stop conditions:
| Condition | Rule | Purpose |
|---|---|---|
| ⬇️ Minimum iters | Must reach N before any stop | Prevents premature convergence |
| 🛑 Max 30 iters | Hard safety cap | Cost protection |
| ❌ 3× discard streak | Stop + analyze | Detects structural problems |
| ✅ 3× 100% pass | Confirmed perfect | After minimum reached |
| ➡️ 5× retained streak | Fully converged | No further improvement possible |
Validator noise detection: In multi-model setups, validators can produce false positives. AutoForge recognizes config/path confusion, inverted checks, normal English flagged as forbidden references, and over-counting. After all real fixes, if >3 discards stem from non-reproducible complaints → declare convergence.
🔀 Multi-Model Cross-Validation
For complex audits, split optimizer and validator across different models:
| Role | Example Models | Task |
|---|---|---|
| Optimizer | Claude Opus, GPT-4.1 | Finds issues, writes fixes |
| Validator | GPT-5, Gemini | Checks against ground truth independently |
The validator doesn't see the optimizer's reasoning — just the output. This prevents the "same model validates its own work" blind spot.
🏆 Real-World Results
Production runs, not demos.
coding-agent SKILL.md — 553 lines rewritten across 10 iterations. 16 Codex subcommands + 40 Claude CLI flags documented. 45% → 100%. Discovered --yolo was never a real flag.
ACP Router — 90% → 100% in 9 iterations. Agent coverage doubled from 6 to 12 harnesses. Thread spawn recovery policy written from scratch.
Sub-Agents Documentation — 70% → 100% in 14 iterations with multi-model validation. 6 real bugs found in upstream docs. Identified 4 categories of validator false positives.
backup.sh — Added rsync support, validation checks, restore-test. Code mode with real execution. 3 iterations to stable, 2 more to polish.
AutoForge on itself 🤯 — Self-forged in project mode: 67% → 100% in 8 iterations. Fixed 2 script bugs, cleaned config/doc inconsistencies across 7 files.
🗂️ Directory Structure
autoforge/
├── SKILL.md ← OpenClaw skill definition
├── README.md ← You are here
├── LICENSE ← MIT
├── .gitignore
├── scripts/
│ ├── report.sh ← Live reporting (channel or stdout)
│ └── visualize.py ← PNG progress chart generator
├── references/
│ ├── eval-examples.md ← 200+ pre-built evals by category
│ └── ml-mode.md ← ML training integration guide
├── examples/
│ ├── demo-results.tsv ← Sample iteration data
│ └── example-config.json ← Reference template
└── results/ ← Your run data (gitignored)
└── .gitkeep
⚙️ Configuration
AutoForge is configured entirely via environment variables. No config file needed.
| Variable | Default | Description |
|---|---|---|
AF_CHANNEL |
telegram |
Report delivery channel |
AF_CHAT_ID |
(none) | Chat/group ID. Unset = stdout |
AF_TOPIC_ID |
(none) | Thread/topic ID |
| Flag | Behavior |
|---|---|
--dry-run (default) |
Only TSV + proposed files. Target unchanged. |
--live |
Overwrites target. Auto-backup to results/backups/. |
--resume |
Continue from existing TSV. |
🤝 Contributing
- Fork → feature branch → PR
- Run
shellcheck scripts/report.shandpython3 -m py_compile scripts/visualize.py - Include real-world results from your own runs if possible
Good contributions: new eval templates, additional channel support, bug fixes with repro steps, production run case studies.
License
MIT — see LICENSE.
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.