Use when adding new error messages to React, or seeing "unknown error code" warnings.
npx skills add akrimm702/autoforge
Or install specific skill: npx add-skill https://github.com/akrimm702/autoforge
# Description
Autonomous iterative optimization for code, skills, prompts, and repositories. Top-agent orchestrates with mathematical convergence. Four modes: prompt (mental simulation), code (real execution + tests), audit (CLI testing + doc fixing), project (whole-repo optimization). Use when: user says "autoforge", "forge", "optimize skill", "improve", "run autoforge", "optimize code", "improve script", "optimize repo", "forge project", "check project", "repo audit".
# SKILL.md
name: autoforge
description: 'Autonomous iterative optimization for code, skills, prompts, and repositories. Top-agent orchestrates with mathematical convergence. Four modes: prompt (mental simulation), code (real execution + tests), audit (CLI testing + doc fixing), project (whole-repo optimization). Use when: user says "autoforge", "forge", "optimize skill", "improve", "run autoforge", "optimize code", "improve script", "optimize repo", "forge project", "check project", "repo audit".'
AutoForge β Top-Agent Architecture
Overview
Agent (you)
βββ State: results.tsv, current target file state, iteration counter
βββ Iteration 1: evaluate β improve β write TSV β report
βββ Iteration 2: evaluate β improve β write TSV β report
βββ ...
βββ Finish: report.sh --final β configured channel
Sub-Agent = You
"Sub-Agent" is a conceptual role, not a separate process. You (the top-agent) execute each iteration yourself: simulate/execute β evaluate β write TSV β call report.sh. The templates below describe what you do PER ITERATION β not what you send to another agent.
For code execution (mode: code), use exec directly.
Multi-Model Setup (recommended for Deep Audits)
For complex audits, you can split two roles across different models:
| Role | Model | Task |
|---|---|---|
| Optimizer | Opus / GPT-4.1 | Analyzes, finds issues, writes fixes |
| Validator | GPT-5 / Gemini (different model) | Checks against ground truth, provides pass rate |
Flow: Optimizer and Validator alternate. Optimizer iterations have status improved/retained/discard. Validator iterations confirm or refute the pass rate. Spawn validators as sub-agents with sessions_spawn and explicit model.
When to use Multi-Model: Deep Audits (>5 iterations expected), complex ground truth, or when a single model is blind to its own errors.
When Single-Model suffices: Simple CLI audits, prompt optimization, code with clear tests.
Configuration
AutoForge uses environment variables for reporting. All are optional β without them, output goes to stdout.
| Variable | Default | Description |
|---|---|---|
AF_CHANNEL |
telegram |
Messaging channel for reports |
AF_CHAT_ID |
(none) | Chat/group ID for report delivery |
AF_TOPIC_ID |
(none) | Thread/topic ID within the chat |
Hard Invariants
These rules apply always, regardless of mode:
- TSV is mandatory. Every iteration writes exactly one row to
results/[target]-results.tsv. - Reporting is mandatory. Call
report.shimmediately after every TSV row. - --dry-run never overwrites the target. Only TSV,
*-proposed.md, and reports are written. - Mode isolation is strict. Only execute steps for the assigned mode.
- Iteration 1 = Baseline. Evaluate the original version unchanged, status
baseline.
Modes β Read ONLY Your Mode!
You are assigned ONE mode. Ignore all sections for other modes.
| Mode | What happens | Output |
|---|---|---|
prompt |
Mentally simulate skill/prompt, evaluate against evals | Improved prompt text |
code |
Execute code in sandbox, measure tests | Improved code |
audit |
Test CLI commands (read-only only!) + verify SKILL.md against reality | Improved SKILL.md |
project |
Scan whole repo, cross-file analysis, fix multiple files per iteration | Improved repository |
Your mode is in the task prompt. Everything else is irrelevant to you.
TSV Format (same for ALL modes)
Header (once at loop start):
printf '%s\t%s\t%s\t%s\t%s\n' "iteration" "prompt_version_summary" "pass_rate" "change_description" "status" > results/[target]-results.tsv
Row per iteration:
printf '%s\t%s\t%s\t%s\t%s\n' "1" "Baseline" "58%" "Original version" "baseline" >> results/[target]-results.tsv
Use
printfnotecho -e!echo -einterprets backslashes in field values.printf '%s'outputs strings literally.
5 columns, TAB-separated, EXACTLY this order:
| # | Column | Type | Rules |
|---|---|---|---|
| 1 | iteration |
Integer | 1, 2, 3, ... |
| 2 | prompt_version_summary |
String | Max 50 Unicode chars. No tabs, no newlines. |
| 3 | pass_rate |
String | Number + %: 58%, 92%, 100%. Always integer. |
| 4 | change_description |
String | Max 100 Unicode chars. No tabs, no newlines. |
| 5 | status |
Enum | Exactly one of: baseline Β· improved Β· retained Β· discard |
Escaping rules:
- Tabs in text fields β replace with spaces
- Newlines in text fields β replace with
| - Empty fields β use hyphen
-(never leave empty) $and backticks β useprintf '%s'or escape with\$(shell expansion risk!)- Unicode/Emoji allowed, count as 1 character (not bytes)
Status rules (based on pass-rate comparison):
baselineβ Mandatory for Iteration 1. Evaluate original version only.improvedβ Pass rate higher than previous best β new version becomes current stateretainedβ Pass rate equal or marginally better β predecessor remainsdiscardβ Pass rate lower β change discarded, revert to best state
Reporting (same for ALL modes)
After EVERY TSV row (including baseline):
bash scripts/report.sh results/[target]-results.tsv "[Skill Name]"
After loop ends, additionally with --final:
bash scripts/report.sh results/[target]-results.tsv "[Skill Name]" --final
The report script reads AF_CHANNEL, AF_CHAT_ID, and AF_TOPIC_ID from environment. Without them, it prints to stdout with ANSI colors.
Stop Conditions (for ALL modes)
Priority β first matching condition wins, top to bottom:
- π Minimum iterations β If specified in task (e.g. "min 5"), this count MUST be reached. No other condition can stop before.
- π Max 30 iterations β Hard safety net, stop immediately.
- β 3Γ
discardin a row β structural problem, stop + analyze. - β 3Γ 100% pass rate (after minimum) β confirmed perfect, done.
- β‘οΈ 5Γ
retainedin a row β converged, done.
Counting rules:
3Γ 100%= three iterations withpass_rate == 100%, not necessarily consecutive.5Γ retainedand3Γ discard= consecutive (in a row).baselinecounts toward no series.improvedinterruptsretainedanddiscardseries.
At 100% in early iterations: Keep going! Test harder edge cases. Only 3Γ 100% after the minimum confirms true perfection.
Recognizing Validator Noise
In multi-model setups, the Validator can produce false positives β fails that aren't real issues:
- Config path vs tool name confusion (e.g.
agents.list[]βagents_listtool) - Inverted checks ("no X" β Validator looks for X as required)
- Normal English as forbidden reference (e.g. "runtime outcome" β
runtime: "acp") - Overcounting (thread commands counted as subagent commands)
Rule: If after all real fixes >3 discards come in a row and the fail justifications don't hold up under scrutiny β declare convergence, don't validate endlessly.
Execution Modes
| Flag | Behavior |
|---|---|
--dry-run (default) |
Only TSV + proposed files. Target file/repo remains unchanged. |
--live |
Target file/repo is overwritten. Auto-backup β results/backups/ |
--resume |
Read existing TSV, continue from last iteration. On invalid format: abort. |
mode: prompt
Only read if your task contains
mode: prompt!
Per Iteration: What you do
- Read current prompt/skill
- Mentally simulate 5 different realistic scenarios
- Evaluate each scenario against all evals (Yes=1, No=0)
- Pass rate = (Sum Yes) / (Eval count Γ 5 scenarios) Γ 100
- Compare with best previous pass rate β determine status
- On
improved: propose minimal, surgical improvement - Write TSV row + call report.sh
- Check stop conditions
At the End
Best version β results/[target]-proposed.md + report.sh --final
mode: code
Only read if your task contains
mode: code!
Per Iteration: What you do
- Create sandbox:
SCRATCH=$(mktemp -d) && cd $SCRATCH - Write current code to sandbox
- Execute test command (with
timeout 60s) - Measure: exit_code, stdout, stderr, runtime
- Evaluate against evals β calculate pass rate
- On
improved: minimal code improvement + verify again - Write TSV row + call report.sh
- Check stop conditions
Code Eval Types
| Eval Type | Description | Example |
|---|---|---|
exit_code |
Process exit code | exit_code == 0 |
output_contains |
stdout contains string | "SUCCESS" in stdout |
output_matches |
stdout matches regex | r"Total: \d+" |
test_pass |
Test framework green | pytest exit 0 |
runtime |
Runtime limit | < 5000ms |
no_stderr |
No error output | stderr == "" |
file_exists |
Output file created | result.json exists |
json_valid |
Output is valid JSON | json.loads(stdout) |
At the End
Best code β results/[target]-proposed.[ext] + report.sh --final
mode: audit
Only read if your task contains
mode: audit!
β οΈ DO NOT write or execute your own code. Only test CLI commands of the target tool (--help + read-only).
Two Variants
Simple Audit (CLI skill, clear commands):
- 2 iterations: Baseline β Proposed Fix
- For tools with clear --help output and simple command structure
Deep Audit (complex docs, many checks):
- Iterative loop like prompt/code, same stop conditions
- For extensive documentation with many checkpoints (e.g. config keys, tool policy, parameter lists)
- Recommended: Multi-Model setup (Opus Optimizer + external Validator)
Simple Audit Flow
- Write TSV header
- Iteration 1 (Baseline): Test every documented command β pass rate β TSV + report
- Iteration 2 (Proposed Fix): Write improved SKILL.md β expected pass rate β TSV + report
- Improved SKILL.md β
results/[target]-proposed.md - Detail results β
results/[target]-audit-details.md(NOT in TSV!) - report.sh
--final
Deep Audit Flow
- Write TSV header
- Iteration 1 (Baseline): Extract ground truth from source, define all checks, evaluate baseline
- Iterations 2+: Optimizer fixes issues β Validator checks β TSV + report per iteration
- Loop runs until stop conditions trigger (3Γ 100%, 5Γ retained, 3Γ discard)
- Final version β
results/[target]-proposed.mdorresults/[target]-v1.md - report.sh
--final
Fixed Evals (audit)
- Completeness β Does SKILL.md cover β₯80% of real commands/config?
- Correctness β Are β₯90% of documented commands/params syntactically correct?
- No stale references β Does everything documented actually exist?
- No missing core features β Are all important features covered?
- Workflow quality β Does quick-start actually work?
mode: project
Only read if your task contains
mode: project!
β οΈ This mode operates on an ENTIRE repository/directory, not a single file. Cross-file consistency is the core feature β this is NOT "audit on many files."
Three Phases
Project mode runs through three sequential phases. Phases 1 and 2 happen once (in Iteration 1 = Baseline). Phase 3 is the iterative fix loop.
Phase 1: Scan & Plan
- Analyze the repo directory:
bash # Discover structure tree -L 3 --dirsfirst [target_dir] ls -la [target_dir] - Identify relevant files and classify by priority:
| Priority | Files |
|---|---|
| critical | README, Dockerfile, CI workflows (.github/workflows), package.json/requirements.txt, main entry points |
| normal | Tests, configs, scripts, .env.example, .gitignore |
| low | Docs, examples, LICENSE, CHANGELOG |
- Build the File-Map β a mental inventory of what exists and what's missing.
- Compose eval set: Merge user-provided evals with auto-detected evals (see Default Evals below).
Phase 2: Cross-File Analysis
Run consistency checks across files. Each check = one eval point:
| Check | What it verifies |
|---|---|
| README β CLI | Documented commands/flags match actual --help output |
| Dockerfile β deps | requirements.txt / package.json versions match what Dockerfile installs |
| CI β project structure | Workflow references correct paths, scripts, test commands |
.env.example β code |
Every env var in code has a corresponding entry in .env.example |
| Imports β dependencies | Every import / require has a matching dependency declaration |
| Tests β source | Test files exist for critical modules |
.gitignore β artifacts |
Build outputs, secrets, and caches are excluded |
Result of Phase 2: A complete eval checklist with per-file and cross-file checks, each scored Yes/No.
Phase 3: Iterative Fix Loop
Same loop logic as prompt/code/audit β TSV, report.sh, stop conditions. Key differences:
- Multiple files can be changed per iteration
- Pass rate = aggregated over ALL evals (file-specific + cross-file)
- Fixes are minimal and surgical β don't refactor blindly, only fix what improves pass rate
change_descriptionincludes which files were touched:"Fix Dockerfile + CI workflow sync"
Per Iteration: What you do
- Evaluate current repo state against all evals (file-specific + cross-file)
- Calculate pass rate: (passing evals / total evals) Γ 100
- Compare with best previous pass rate β determine status
- On
improved: apply minimal, surgical fixes to the fewest files necessary - Verify the fix didn't break other evals (re-run affected checks)
- Write TSV row + call report.sh
- Check stop conditions
Dry-Run vs Live
| Flag | Behavior |
|---|---|
--dry-run (default) |
Fixed files β results/[target]-proposed/ directory (mirrors repo structure). Original repo untouched. |
--live |
Files overwritten in-place. Originals backed up β results/backups/ (preserving directory structure). |
Default Evals (auto-applied unless overridden)
These evals are automatically used when the user doesn't provide custom evals. The agent detects which are applicable based on what exists in the repo:
| # | Eval | Condition |
|---|---|---|
| 1 | README accurate? (describes actual features/commands) | README exists |
| 2 | Tests present and green? (pytest / npm test / go test) |
Test files or test config detected |
| 3 | CI configured and syntactically correct? | .github/workflows/ or .gitlab-ci.yml exists |
| 4 | No hardcoded secrets? (grep -rE "(password|api_key|token|secret)\s*=") |
Always |
| 5 | Dependencies complete? (requirements.txt β imports, package.json β requires) |
Dependency file exists |
| 6 | Dockerfile functional? (docker build succeeds or Dockerfile syntax valid) |
Dockerfile exists |
| 7 | .gitignore sensible? (no secrets, build artifacts excluded) |
.gitignore exists |
| 8 | License present? | Always |
Eval Scoring
Pass Rate = (Passing Evals / Total Applicable Evals) Γ 100
Evals that don't apply (e.g. "Dockerfile functional?" when no Dockerfile exists) are excluded from the total, not counted as passes.
At the End
--dry-run: All proposed changes βresults/[target]-proposed/directory--live: Changes already applied, backups inresults/backups/- report.sh
--final - Optionally:
results/[target]-project-details.mdwith per-file findings (NOT in TSV!)
Directory Structure
autoforge/
βββ SKILL.md β This file
βββ results/
β βββ [target]-results.tsv β TSV logs
β βββ [target]-proposed.md β Proposed improvement (prompt/audit)
β βββ [target]-proposed/ β Proposed repo changes (project mode)
β β βββ README.md
β β βββ Dockerfile
β β βββ ...
β βββ [target]-v1.md β Deep audit final version
β βββ [target]-audit-details.md β Audit details (audit mode only)
β βββ [target]-project-details.md β Project details (project mode only)
β βββ backups/ β Auto-backups (--live)
β βββ [file].bak β Single file backups (prompt/code/audit)
β βββ [target]-backup/ β Full directory backup (project mode)
βββ scripts/
β βββ report.sh β Channel reporting
β βββ visualize.py β PNG chart (optional)
βββ references/
β βββ eval-examples.md β Pre-built evals
β βββ ml-mode.md β ML training guide
βββ examples/
βββ demo-results.tsv β Demo data
βββ example-config.json β Example configuration
Examples (task descriptions, NOT CLI commands)
AutoForge is not a CLI tool β it's a skill prompt for the agent:
# Optimize a prompt
"Start autoforge mode: prompt for the coding-agent skill.
Evals: PTY correct? Workspace protected? Clearly structured?"
# Audit a CLI skill (simple)
"Start autoforge mode: audit for notebooklm-py."
# Deep audit with multi-model
"Start autoforge mode: audit (deep) for subagents docs.
Optimizer: Opus, Validator: GPT-5
Extract ground truth from source, validate iteratively."
# Optimize code
"Start autoforge mode: code for backup.sh.
File: ./backup.sh
Test: bash backup.sh personal --dry-run
Evals: exit_code==0, backup file created, < 10s runtime"
# Optimize a whole repository
"Start autoforge mode: project for ./my-app
Evals: Tests green? CI correct? No hardcoded secrets? README accurate?"
# Project mode with custom focus
"Start autoforge mode: project for /path/to/api-server
Focus: Docker + CI pipeline consistency
Evals: docker build succeeds, CI workflow references correct paths,
.env.example covers all env vars used in code"
# Project mode dry-run (default)
"Start autoforge mode: project for ./my-tool --dry-run
Use default evals. Show me what needs fixing."
Eval Examples β Mode Mapping
references/eval-examples.md provides ready-to-use Yes/No evals grouped by category. Here's how they map to AutoForge modes:
| eval-examples.md Category | AutoForge Mode | Notes |
|---|---|---|
| Briefing, Email, Calendar, Summary, Proposal | prompt |
Mental simulation with scenario evals |
| Python Script, Shell Script, API, Data Pipeline, Build | code |
Real execution with measurable criteria |
| CI/CD, Docker, Helm, Kubernetes, Terraform | code or project |
code for single files, project for cross-file |
| Code Review, API Documentation | audit |
Verify docs match reality |
| Project / Repository, Cross-File Consistency, Security Baseline | project |
Whole-repo scanning and cross-file checks |
Pick evals from the matching category and paste them into your task prompt as the eval set.
Tips
- Always start with
--dry-run prompt= think,code= execute,audit= test CLI,project= optimize repo- Simple Audit for clear CLI skills, Deep Audit for complex docs
- Project mode scans the whole repo β cross-file consistency is the killer feature
- Multi-Model for Deep Audits: different models cover different blind spots
- At >3 discards after all fixes: check for validator noise, declare convergence if justified
- TSV + report.sh are NOT optional β they are the user interface
- For ML training: see
references/ml-mode.md
# README.md
Most "self-improving agent" approaches boil down to "reflect on your output."
That's a vibe check, not optimization. AutoForge is different.
| Typical "Reflect" | AutoForge | |
|---|---|---|
| When to stop | "Looks good to me" | 3Γ 100% pass, 5Γ retained, or 3Γ discard |
| Progress | Chat history | TSV with pass rates & diffs per iteration |
| Validation | Same model checks itself | Multi-model cross-validation |
| Reporting | Final summary | Live Unicode bars after every iteration |
| Modes | One generic loop | 4 specialized modes |
| Track record | Demo | 50+ iterations across 6 production skills |
ποΈ Architecture
Core Loop
βββββββββββββββββββββββ
β Define target + β
β evals β
βββββββββββ¬ββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Baseline scan β
βββββββββββ¬ββββββββββββ
β
ββββββββββββββββ
β
βΌ
βββββββββββββββββ proposed βββββββββββββββββ
ββββΆβ Optimizer ββββββββββββββββββΆβ Validator β
β β (Claude Opus)β β (GPT-5) β
β βββββββββββββββββ βββββββββ¬ββββββββ
β β
β βββββββββββββββββ΄ββββββββββββββββ
β β β
β βΌ βΌ
β βββββββββββββββ βββββββββββββββ
β β β
improved β β β discarded β
β ββββββββ¬βββββββ ββββββββ¬βββββββ
β β β
β βββββββββββββ¬ββββββββββββββ¬ββββββ
β βΌ β
β βββββββββββββββ β
β β Log to TSV β β
β β + report.sh β β
β ββββββββ¬βββββββ β
β β β
β βΌ β
β ββββββββββββββββ β
β No β Converged? β β
ββββββββββββββββββββββββββββββββ€ β β
ββββββββ¬ββββββββ β
β β
βββββββββββββββββββββΌββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββ ββββββββββββββββ βββββββββββββ
β 3Γ 100% β
β β 5Γ retained β β 3Γ discardβ
β Deploy! β β Converged β β Stop β οΈ β
βββββββββββββββ ββββββββββββββββ βββββββββββββ
Multi-Model Validation
Iter 1 (Optimizer) Iter 2 (Validator) Iter 3 (Optimizer)
βββββββββββββββββ ββββββββββββββββββ βββββββββββββββββ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Claude Opus β β GPT-5 β β Claude Opus β
β β β β β β
β Analyze β β Blind review β β Fix findings β
β Find issues β β of output β β from GPT-5 β
β Write fixes β β (no context) β β β
ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β β
βΌ βΌ βΌ
pass_rate: 62% pass_rate: 78% pass_rate: 95%
status: improved status: improved status: improved
βββββββ TSV ββββββββββββββββ TSV ββββββββββββββββ TSV βββββββ
Different model validates β no "grading your own homework" blind spot
Project Mode β Three Phases
Phase 1 Phase 2 Phase 3
SCAN & PLAN CROSS-FILE ANALYSIS ITERATIVE FIX LOOP
βββββββββββββ βββββββββββββββββββ ββββββββββββββββββ
ββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββ
β Walk repo β β README β CLI β βββββΆβ Surgical fix β
β tree β β Dockerfile β depsβ β β across files β
β ββββββββββββΆβ CI β scripts βββββΆβ ββββββββ¬ββββββββ
β Build file β β .env β code refs β β β
β priority map β β imports β reqs β β βΌ
ββββββββββββββββ β .gitignore β out β β ββββββββββββββββ
ββββββββββββββββββββ β β Validate β
β β consistency β
β ββββββββ¬ββββββββ
β β
β Not β Done?
ββββ yet ββββ
π Quick Start
Install
# Via ClawHub
clawhub install autoforge
# Or clone
git clone https://github.com/akrimm702/autoforge.git
cp -r autoforge ~/.openclaw/workspace/skills/autoforge
Configure reporting (optional)
export AF_CHANNEL="telegram" # telegram | discord | slack
export AF_CHAT_ID="-100XXXXXXXXXX" # chat/group ID
export AF_TOPIC_ID="1234" # thread ID (optional)
No env vars? Reports print to stdout with ANSI colors.
Tell your agent
Start autoforge mode: prompt for the coding-agent skill.
Evals: PTY handling correct? Workspace protection enforced? Clear structure?
The agent reads the skill, runs the loop, tracks everything in TSV, reports live, and stops when convergence math says it's done.
π§ Four Modes
prompt β Mental Simulation
Simulates 5 realistic scenarios per iteration, evaluates Yes/No against defined evals, calculates pass rate mathematically. No code execution.
Best for: SKILL.md files, prompt engineering, documentation, briefing templates.
code β Real Execution
Runs code in a sandbox, measures exit codes, stdout, stderr, runtime. Evaluates against concrete test criteria.
Best for: Shell scripts, Python tools, data pipelines, build systems.
audit β CLI Testing
Tests documented commands against actual CLI behavior (--help, read-only). Catches docs-vs-reality drift. Two variants: Simple (2 iterations) or Deep (iterative with multi-model).
Best for: Verifying skill documentation matches real CLI behavior.
project β Whole Repository β
Scans an entire repo, builds a file-map with priorities, runs cross-file consistency checks, and iteratively fixes issues across multiple files.
Best for: README β CLI drift, Dockerfile β dependency mismatches, CI β project structure gaps.
Cross-file checks include:
- README documents what the CLI actually does
- Dockerfile installs the right dependency versions
- CI workflows reference correct paths and scripts
.env.examplecovers all env vars used in code- Every import has a matching dependency declaration
.gitignoreexcludes build artifacts and secrets
π Live Reporting
After each iteration, report.sh sends live updates:
π AutoForge: coding-agent
π Iter 1 ββββββββββββββββββββ 45%
β
Iter 2 ββββββββββββββββββββ 62%
β
Iter 3 ββββββββββββββββββββ 78%
β
Iter 4 ββββββββββββββββββββ 85%
β
Iter 5 ββββββββββββββββββββ 90%
β
Iter 6 ββββββββββββββββββββ 92%
β
Iter 7 ββββββββββββββββββββ 92%
β
Iter 8 ββββββββββββββββββββ 95%
β
Iter 9 ββββββββββββββββββββ 95%
β
Iter 10 ββββββββββββββββββββ 100%
ββββββββββββββββββββββ
Iterations: 10 β
Keep: 10 β Discard: 0
π Best pass rate: 100% (Iter 10)
β
Loop converged β improvement found
Every iteration is tracked in TSV:
iteration prompt_version_summary pass_rate change_description status
1 Baseline 45% Original SKILL.md baseline baseline
2 Add missing subcommands 62% 16 Codex subcommands added improved
3 Fix approval flags 78% Scoped flags by context improved
...
10 Validation pass 100% All checks green improved
π Convergence Rules
No vibes. No "looks good." Mathematical stop conditions:
| Condition | Rule | Purpose |
|---|---|---|
| β¬οΈ Minimum iters | Must reach N before any stop | Prevents premature convergence |
| π Max 30 iters | Hard safety cap | Cost protection |
| β 3Γ discard streak | Stop + analyze | Detects structural problems |
| β 3Γ 100% pass | Confirmed perfect | After minimum reached |
| β‘οΈ 5Γ retained streak | Fully converged | No further improvement possible |
Validator noise detection: In multi-model setups, validators can produce false positives. AutoForge recognizes config/path confusion, inverted checks, normal English flagged as forbidden references, and over-counting. After all real fixes, if >3 discards stem from non-reproducible complaints β declare convergence.
π Multi-Model Cross-Validation
For complex audits, split optimizer and validator across different models:
| Role | Example Models | Task |
|---|---|---|
| Optimizer | Claude Opus, GPT-4.1 | Finds issues, writes fixes |
| Validator | GPT-5, Gemini | Checks against ground truth independently |
The validator doesn't see the optimizer's reasoning β just the output. This prevents the "same model validates its own work" blind spot.
π Real-World Results
Production runs, not demos.
coding-agent SKILL.md β 553 lines rewritten across 10 iterations. 16 Codex subcommands + 40 Claude CLI flags documented. 45% β 100%. Discovered --yolo was never a real flag.
ACP Router β 90% β 100% in 9 iterations. Agent coverage doubled from 6 to 12 harnesses. Thread spawn recovery policy written from scratch.
Sub-Agents Documentation β 70% β 100% in 14 iterations with multi-model validation. 6 real bugs found in upstream docs. Identified 4 categories of validator false positives.
backup.sh β Added rsync support, validation checks, restore-test. Code mode with real execution. 3 iterations to stable, 2 more to polish.
AutoForge on itself π€― β Self-forged in project mode: 67% β 100% in 8 iterations. Fixed 2 script bugs, cleaned config/doc inconsistencies across 7 files.
ποΈ Directory Structure
autoforge/
βββ SKILL.md β OpenClaw skill definition
βββ README.md β You are here
βββ LICENSE β MIT
βββ .gitignore
βββ scripts/
β βββ report.sh β Live reporting (channel or stdout)
β βββ visualize.py β PNG progress chart generator
βββ references/
β βββ eval-examples.md β 200+ pre-built evals by category
β βββ ml-mode.md β ML training integration guide
βββ examples/
β βββ demo-results.tsv β Sample iteration data
β βββ example-config.json β Reference template
βββ results/ β Your run data (gitignored)
βββ .gitkeep
βοΈ Configuration
AutoForge is configured entirely via environment variables. No config file needed.
| Variable | Default | Description |
|---|---|---|
AF_CHANNEL |
telegram |
Report delivery channel |
AF_CHAT_ID |
(none) | Chat/group ID. Unset = stdout |
AF_TOPIC_ID |
(none) | Thread/topic ID |
| Flag | Behavior |
|---|---|
--dry-run (default) |
Only TSV + proposed files. Target unchanged. |
--live |
Overwrites target. Auto-backup to results/backups/. |
--resume |
Continue from existing TSV. |
π€ Contributing
- Fork β feature branch β PR
- Run
shellcheck scripts/report.shandpython3 -m py_compile scripts/visualize.py - Include real-world results from your own runs if possible
Good contributions: new eval templates, additional channel support, bug fixes with repro steps, production run case studies.
License
MIT β see LICENSE.
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.