autoforge

by @akrimm702 in Development

# Install this skill:

npx skills add akrimm702/autoforge

Or install specific skill: npx add-skill https://github.com/akrimm702/autoforge

# Description

Autonomous iterative optimization for code, skills, prompts, and repositories. Top-agent orchestrates with mathematical convergence. Four modes: prompt (mental simulation), code (real execution + tests), audit (CLI testing + doc fixing), project (whole-repo optimization). Use when: user says "autoforge", "forge", "optimize skill", "improve", "run autoforge", "optimize code", "improve script", "optimize repo", "forge project", "check project", "repo audit".

# SKILL.md

name: autoforge
description: 'Autonomous iterative optimization for code, skills, prompts, and repositories. Top-agent orchestrates with mathematical convergence. Four modes: prompt (mental simulation), code (real execution + tests), audit (CLI testing + doc fixing), project (whole-repo optimization). Use when: user says "autoforge", "forge", "optimize skill", "improve", "run autoforge", "optimize code", "improve script", "optimize repo", "forge project", "check project", "repo audit".'

AutoForge — Top-Agent Architecture

Overview

Agent (you)
├── State: results.tsv, current target file state, iteration counter
├── Iteration 1: evaluate → improve → write TSV → report
├── Iteration 2: evaluate → improve → write TSV → report
├── ...
└── Finish: report.sh --final → configured channel

Sub-Agent = You

"Sub-Agent" is a conceptual role, not a separate process. You (the top-agent) execute each iteration yourself: simulate/execute → evaluate → write TSV → call report.sh. The templates below describe what you do PER ITERATION — not what you send to another agent.

For code execution (mode: code), use exec directly.

Multi-Model Setup (recommended for Deep Audits)

For complex audits, you can split two roles across different models:

Role	Model	Task
Optimizer	Opus / GPT-4.1	Analyzes, finds issues, writes fixes
Validator	GPT-5 / Gemini (different model)	Checks against ground truth, provides pass rate

Flow: Optimizer and Validator alternate. Optimizer iterations have status improved/retained/discard. Validator iterations confirm or refute the pass rate. Spawn validators as sub-agents with sessions_spawn and explicit model.

When to use Multi-Model: Deep Audits (>5 iterations expected), complex ground truth, or when a single model is blind to its own errors.

When Single-Model suffices: Simple CLI audits, prompt optimization, code with clear tests.

Configuration

AutoForge uses environment variables for reporting. All are optional — without them, output goes to stdout.

Variable	Default	Description
`AF_CHANNEL`	`telegram`	Messaging channel for reports
`AF_CHAT_ID`	(none)	Chat/group ID for report delivery
`AF_TOPIC_ID`	(none)	Thread/topic ID within the chat

Hard Invariants

These rules apply always, regardless of mode:

TSV is mandatory. Every iteration writes exactly one row to results/[target]-results.tsv.
Reporting is mandatory. Call report.sh immediately after every TSV row.
--dry-run never overwrites the target. Only TSV, *-proposed.md, and reports are written.
Mode isolation is strict. Only execute steps for the assigned mode.
Iteration 1 = Baseline. Evaluate the original version unchanged, status baseline.

Modes — Read ONLY Your Mode!

You are assigned ONE mode. Ignore all sections for other modes.

Mode	What happens	Output
`prompt`	Mentally simulate skill/prompt, evaluate against evals	Improved prompt text
`code`	Execute code in sandbox, measure tests	Improved code
`audit`	Test CLI commands (read-only only!) + verify SKILL.md against reality	Improved SKILL.md
`project`	Scan whole repo, cross-file analysis, fix multiple files per iteration	Improved repository

Your mode is in the task prompt. Everything else is irrelevant to you.

TSV Format (same for ALL modes)

Header (once at loop start):

printf '%s\t%s\t%s\t%s\t%s\n' "iteration" "prompt_version_summary" "pass_rate" "change_description" "status" > results/[target]-results.tsv

Row per iteration:

printf '%s\t%s\t%s\t%s\t%s\n' "1" "Baseline" "58%" "Original version" "baseline" >> results/[target]-results.tsv

Use printf not echo -e! echo -e interprets backslashes in field values. printf '%s' outputs strings literally.

5 columns, TAB-separated, EXACTLY this order:

#	Column	Type	Rules
1	`iteration`	Integer	1, 2, 3, ...
2	`prompt_version_summary`	String	Max 50 Unicode chars. No tabs, no newlines.
3	`pass_rate`	String	Number + `%`: `58%`, `92%`, `100%`. Always integer.
4	`change_description`	String	Max 100 Unicode chars. No tabs, no newlines.
5	`status`	Enum	Exactly one of: `baseline` · `improved` · `retained` · `discard`

Escaping rules:

Tabs in text fields → replace with spaces
Newlines in text fields → replace with |
Empty fields → use hyphen - (never leave empty)
$ and backticks → use printf '%s' or escape with \$ (shell expansion risk!)
Unicode/Emoji allowed, count as 1 character (not bytes)

Status rules (based on pass-rate comparison):

baseline — Mandatory for Iteration 1. Evaluate original version only.
improved — Pass rate higher than previous best → new version becomes current state
retained — Pass rate equal or marginally better → predecessor remains
discard — Pass rate lower → change discarded, revert to best state

Reporting (same for ALL modes)

After EVERY TSV row (including baseline):

bash scripts/report.sh results/[target]-results.tsv "[Skill Name]"

After loop ends, additionally with --final:

bash scripts/report.sh results/[target]-results.tsv "[Skill Name]" --final

The report script reads AF_CHANNEL, AF_CHAT_ID, and AF_TOPIC_ID from environment. Without them, it prints to stdout with ANSI colors.

Stop Conditions (for ALL modes)

Priority — first matching condition wins, top to bottom:

🛑 Minimum iterations — If specified in task (e.g. "min 5"), this count MUST be reached. No other condition can stop before.
🛑 Max 30 iterations — Hard safety net, stop immediately.
❌ 3× discard in a row → structural problem, stop + analyze.
✅ 3× 100% pass rate (after minimum) → confirmed perfect, done.
➡️ 5× retained in a row → converged, done.

Counting rules:

3× 100% = three iterations with pass_rate == 100%, not necessarily consecutive.
5× retained and 3× discard = consecutive (in a row).
baseline counts toward no series.
improved interrupts retained and discard series.

At 100% in early iterations: Keep going! Test harder edge cases. Only 3× 100% after the minimum confirms true perfection.

Recognizing Validator Noise

In multi-model setups, the Validator can produce false positives — fails that aren't real issues:

Config path vs tool name confusion (e.g. agents.list[] ≠ agents_list tool)
Inverted checks ("no X" → Validator looks for X as required)
Normal English as forbidden reference (e.g. "runtime outcome" ≠ runtime: "acp")
Overcounting (thread commands counted as subagent commands)

Rule: If after all real fixes >3 discards come in a row and the fail justifications don't hold up under scrutiny → declare convergence, don't validate endlessly.

Execution Modes

Flag	Behavior
`--dry-run` (default)	Only TSV + proposed files. Target file/repo remains unchanged.
`--live`	Target file/repo is overwritten. Auto-backup → `results/backups/`
`--resume`	Read existing TSV, continue from last iteration. On invalid format: abort.

mode: prompt

Only read if your task contains mode: prompt!

Per Iteration: What you do

Read current prompt/skill
Mentally simulate 5 different realistic scenarios
Evaluate each scenario against all evals (Yes=1, No=0)
Pass rate = (Sum Yes) / (Eval count × 5 scenarios) × 100
Compare with best previous pass rate → determine status
On improved: propose minimal, surgical improvement
Write TSV row + call report.sh
Check stop conditions

At the End

Best version → results/[target]-proposed.md + report.sh --final

mode: code

Only read if your task contains mode: code!

Per Iteration: What you do

Create sandbox: SCRATCH=$(mktemp -d) && cd $SCRATCH
Write current code to sandbox
Execute test command (with timeout 60s)
Measure: exit_code, stdout, stderr, runtime
Evaluate against evals → calculate pass rate
On improved: minimal code improvement + verify again
Write TSV row + call report.sh
Check stop conditions

Code Eval Types

Eval Type	Description	Example
`exit_code`	Process exit code	`exit_code == 0`
`output_contains`	stdout contains string	`"SUCCESS" in stdout`
`output_matches`	stdout matches regex	`r"Total: \d+"`
`test_pass`	Test framework green	`pytest exit 0`
`runtime`	Runtime limit	`< 5000ms`
`no_stderr`	No error output	`stderr == ""`
`file_exists`	Output file created	`result.json exists`
`json_valid`	Output is valid JSON	`json.loads(stdout)`

At the End

Best code → results/[target]-proposed.[ext] + report.sh --final

mode: audit

Only read if your task contains mode: audit!

⚠️ DO NOT write or execute your own code. Only test CLI commands of the target tool (--help + read-only).

Two Variants

Simple Audit (CLI skill, clear commands):
- 2 iterations: Baseline → Proposed Fix
- For tools with clear --help output and simple command structure

Deep Audit (complex docs, many checks):
- Iterative loop like prompt/code, same stop conditions
- For extensive documentation with many checkpoints (e.g. config keys, tool policy, parameter lists)
- Recommended: Multi-Model setup (Opus Optimizer + external Validator)

Simple Audit Flow

Write TSV header
Iteration 1 (Baseline): Test every documented command → pass rate → TSV + report
Iteration 2 (Proposed Fix): Write improved SKILL.md → expected pass rate → TSV + report
Improved SKILL.md → results/[target]-proposed.md
Detail results → results/[target]-audit-details.md (NOT in TSV!)
report.sh --final

Deep Audit Flow

Write TSV header
Iteration 1 (Baseline): Extract ground truth from source, define all checks, evaluate baseline
Iterations 2+: Optimizer fixes issues → Validator checks → TSV + report per iteration
Loop runs until stop conditions trigger (3× 100%, 5× retained, 3× discard)
Final version → results/[target]-proposed.md or results/[target]-v1.md
report.sh --final

Fixed Evals (audit)

Completeness — Does SKILL.md cover ≥80% of real commands/config?
Correctness — Are ≥90% of documented commands/params syntactically correct?
No stale references — Does everything documented actually exist?
No missing core features — Are all important features covered?
Workflow quality — Does quick-start actually work?

mode: project

Only read if your task contains mode: project!

⚠️ This mode operates on an ENTIRE repository/directory, not a single file. Cross-file consistency is the core feature — this is NOT "audit on many files."

Three Phases

Project mode runs through three sequential phases. Phases 1 and 2 happen once (in Iteration 1 = Baseline). Phase 3 is the iterative fix loop.

Phase 1: Scan & Plan

Analyze the repo directory:
bash # Discover structure tree -L 3 --dirsfirst [target_dir] ls -la [target_dir]
Identify relevant files and classify by priority:

Priority	Files
critical	README, Dockerfile, CI workflows (.github/workflows), package.json/requirements.txt, main entry points
normal	Tests, configs, scripts, .env.example, .gitignore
low	Docs, examples, LICENSE, CHANGELOG

Build the File-Map — a mental inventory of what exists and what's missing.
Compose eval set: Merge user-provided evals with auto-detected evals (see Default Evals below).

Phase 2: Cross-File Analysis

Run consistency checks across files. Each check = one eval point:

Check	What it verifies
README ↔ CLI	Documented commands/flags match actual `--help` output
Dockerfile ↔ deps	`requirements.txt` / `package.json` versions match what Dockerfile installs
CI ↔ project structure	Workflow references correct paths, scripts, test commands
`.env.example` ↔ code	Every env var in code has a corresponding entry in `.env.example`
Imports ↔ dependencies	Every `import` / `require` has a matching dependency declaration
Tests ↔ source	Test files exist for critical modules
`.gitignore` ↔ artifacts	Build outputs, secrets, and caches are excluded

Result of Phase 2: A complete eval checklist with per-file and cross-file checks, each scored Yes/No.

Phase 3: Iterative Fix Loop

Same loop logic as prompt/code/audit — TSV, report.sh, stop conditions. Key differences:

Multiple files can be changed per iteration
Pass rate = aggregated over ALL evals (file-specific + cross-file)
Fixes are minimal and surgical — don't refactor blindly, only fix what improves pass rate
change_description includes which files were touched: "Fix Dockerfile + CI workflow sync"

Per Iteration: What you do

Evaluate current repo state against all evals (file-specific + cross-file)
Calculate pass rate: (passing evals / total evals) × 100
Compare with best previous pass rate → determine status
On improved: apply minimal, surgical fixes to the fewest files necessary
Verify the fix didn't break other evals (re-run affected checks)
Write TSV row + call report.sh
Check stop conditions

Dry-Run vs Live

Flag	Behavior
`--dry-run` (default)	Fixed files → `results/[target]-proposed/` directory (mirrors repo structure). Original repo untouched.
`--live`	Files overwritten in-place. Originals backed up → `results/backups/` (preserving directory structure).

Default Evals (auto-applied unless overridden)

These evals are automatically used when the user doesn't provide custom evals. The agent detects which are applicable based on what exists in the repo:

#	Eval	Condition
1	README accurate? (describes actual features/commands)	README exists
2	Tests present and green? (`pytest` / `npm test` / `go test`)	Test files or test config detected
3	CI configured and syntactically correct?	`.github/workflows/` or `.gitlab-ci.yml` exists
4	No hardcoded secrets? (`grep -rE "(password\|api_key\|token\|secret)\s*="`)	Always
5	Dependencies complete? (`requirements.txt` ↔ imports, `package.json` ↔ requires)	Dependency file exists
6	Dockerfile functional? (`docker build` succeeds or Dockerfile syntax valid)	Dockerfile exists
7	`.gitignore` sensible? (no secrets, build artifacts excluded)	`.gitignore` exists
8	License present?	Always

Eval Scoring

Pass Rate = (Passing Evals / Total Applicable Evals) × 100

Evals that don't apply (e.g. "Dockerfile functional?" when no Dockerfile exists) are excluded from the total, not counted as passes.

At the End

--dry-run: All proposed changes → results/[target]-proposed/ directory
--live: Changes already applied, backups in results/backups/
report.sh --final
Optionally: results/[target]-project-details.md with per-file findings (NOT in TSV!)

Directory Structure

autoforge/
├── SKILL.md                    ← This file
├── results/
│   ├── [target]-results.tsv    ← TSV logs
│   ├── [target]-proposed.md    ← Proposed improvement (prompt/audit)
│   ├── [target]-proposed/      ← Proposed repo changes (project mode)
│   │   ├── README.md
│   │   ├── Dockerfile
│   │   └── ...
│   ├── [target]-v1.md          ← Deep audit final version
│   ├── [target]-audit-details.md ← Audit details (audit mode only)
│   ├── [target]-project-details.md ← Project details (project mode only)
│   └── backups/                ← Auto-backups (--live)
│       ├── [file].bak          ← Single file backups (prompt/code/audit)
│       └── [target]-backup/    ← Full directory backup (project mode)
├── scripts/
│   ├── report.sh               ← Channel reporting
│   └── visualize.py            ← PNG chart (optional)
├── references/
│   ├── eval-examples.md        ← Pre-built evals
│   └── ml-mode.md              ← ML training guide
└── examples/
    ├── demo-results.tsv        ← Demo data
    └── example-config.json     ← Example configuration

Examples (task descriptions, NOT CLI commands)

AutoForge is not a CLI tool — it's a skill prompt for the agent:

# Optimize a prompt
"Start autoforge mode: prompt for the coding-agent skill.
 Evals: PTY correct? Workspace protected? Clearly structured?"

# Audit a CLI skill (simple)
"Start autoforge mode: audit for notebooklm-py."

# Deep audit with multi-model
"Start autoforge mode: audit (deep) for subagents docs.
 Optimizer: Opus, Validator: GPT-5
 Extract ground truth from source, validate iteratively."

# Optimize code
"Start autoforge mode: code for backup.sh.
 File: ./backup.sh
 Test: bash backup.sh personal --dry-run
 Evals: exit_code==0, backup file created, < 10s runtime"

# Optimize a whole repository
"Start autoforge mode: project for ./my-app
 Evals: Tests green? CI correct? No hardcoded secrets? README accurate?"

# Project mode with custom focus
"Start autoforge mode: project for /path/to/api-server
 Focus: Docker + CI pipeline consistency
 Evals: docker build succeeds, CI workflow references correct paths,
        .env.example covers all env vars used in code"

# Project mode dry-run (default)
"Start autoforge mode: project for ./my-tool --dry-run
 Use default evals. Show me what needs fixing."

Eval Examples → Mode Mapping

references/eval-examples.md provides ready-to-use Yes/No evals grouped by category. Here's how they map to AutoForge modes:

eval-examples.md Category	AutoForge Mode	Notes
Briefing, Email, Calendar, Summary, Proposal	`prompt`	Mental simulation with scenario evals
Python Script, Shell Script, API, Data Pipeline, Build	`code`	Real execution with measurable criteria
CI/CD, Docker, Helm, Kubernetes, Terraform	`code` or `project`	`code` for single files, `project` for cross-file
Code Review, API Documentation	`audit`	Verify docs match reality
Project / Repository, Cross-File Consistency, Security Baseline	`project`	Whole-repo scanning and cross-file checks

Pick evals from the matching category and paste them into your task prompt as the eval set.

Tips

Always start with --dry-run
prompt = think, code = execute, audit = test CLI, project = optimize repo
Simple Audit for clear CLI skills, Deep Audit for complex docs
Project mode scans the whole repo — cross-file consistency is the killer feature
Multi-Model for Deep Audits: different models cover different blind spots
At >3 discards after all fixes: check for validator noise, declare convergence if justified
TSV + report.sh are NOT optional — they are the user interface
For ML training: see references/ml-mode.md

# README.md

# 🔨 AutoForge ### *Stop vibing. Start converging.* **Autonomous optimization loops for AI agent skills, code, docs & entire repos.** Mathematical convergence · Multi-model cross-validation · Live Unicode reporting [![MIT License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE) [![OpenClaw Skill](https://img.shields.io/badge/OpenClaw-skill-brightgreen.svg)](https://openclaw.ai) [![ClawHub](https://img.shields.io/badge/ClawHub-autoforge-8B5CF6.svg)](https://clawhub.com/skills/autoforge)

Most "self-improving agent" approaches boil down to "reflect on your output."
That's a vibe check, not optimization. AutoForge is different.

	Typical "Reflect"	AutoForge
When to stop	"Looks good to me"	3× 100% pass, 5× retained, or 3× discard
Progress	Chat history	TSV with pass rates & diffs per iteration
Validation	Same model checks itself	Multi-model cross-validation
Reporting	Final summary	Live Unicode bars after every iteration
Modes	One generic loop	4 specialized modes
Track record	Demo	50+ iterations across 6 production skills

🏗️ Architecture

Core Loop

                         ┌─────────────────────┐
                         │  Define target +     │
                         │  evals               │
                         └─────────┬───────────┘
                                   │
                                   ▼
                         ┌─────────────────────┐
                         │  Baseline scan       │
                         └─────────┬───────────┘
                                   │
                    ┌──────────────┘
                    │
                    ▼
           ┌───────────────┐     proposed     ┌───────────────┐
       ┌──▶│   Optimizer   │────────────────▶│   Validator   │
       │   │  (Claude Opus)│                  │    (GPT-5)    │
       │   └───────────────┘                  └───────┬───────┘
       │                                              │
       │                              ┌───────────────┴───────────────┐
       │                              │                               │
       │                              ▼                               ▼
       │                     ┌─────────────┐                 ┌─────────────┐
       │                     │ ✅ improved  │                 │ ❌ discarded │
       │                     └──────┬──────┘                 └──────┬──────┘
       │                            │                               │
       │                            └───────────┬─────────────┬─────┘
       │                                        ▼             │
       │                               ┌─────────────┐       │
       │                               │  Log to TSV  │       │
       │                               │  + report.sh │       │
       │                               └──────┬──────┘       │
       │                                      │              │
       │                                      ▼              │
       │                              ┌──────────────┐       │
       │          No                  │  Converged?  │       │
       └──────────────────────────────┤              │       │
                                      └──────┬───────┘       │
                                             │               │
                         ┌───────────────────┼───────────────┐
                         │                   │               │
                         ▼                   ▼               ▼
                  ┌─────────────┐   ┌──────────────┐  ┌───────────┐
                  │ 3× 100% ✅  │   │ 5× retained  │  │ 3× discard│
                  │   Deploy!   │   │  Converged   │  │   Stop ⚠️  │
                  └─────────────┘   └──────────────┘  └───────────┘

Multi-Model Validation

  Iter 1 (Optimizer)          Iter 2 (Validator)         Iter 3 (Optimizer)
  ─────────────────           ──────────────────         ─────────────────

  ┌──────────────┐            ┌──────────────┐           ┌──────────────┐
  │ Claude Opus  │            │    GPT-5     │           │ Claude Opus  │
  │              │            │              │           │              │
  │ Analyze      │            │ Blind review │           │ Fix findings │
  │ Find issues  │            │ of output    │           │ from GPT-5   │
  │ Write fixes  │            │ (no context) │           │              │
  └──────┬───────┘            └──────┬───────┘           └──────┬───────┘
         │                           │                          │
         ▼                           ▼                          ▼
  pass_rate: 62%              pass_rate: 78%              pass_rate: 95%
  status: improved            status: improved            status: improved

         └────── TSV ──────────────── TSV ──────────────── TSV ──────┘

  Different model validates → no "grading your own homework" blind spot

Project Mode — Three Phases

  Phase 1                    Phase 2                      Phase 3
  SCAN & PLAN                CROSS-FILE ANALYSIS          ITERATIVE FIX LOOP
  ─────────────              ───────────────────          ──────────────────

  ┌──────────────┐           ┌──────────────────┐         ┌──────────────┐
  │ Walk repo    │           │ README ↔ CLI     │    ┌───▶│ Surgical fix │
  │ tree         │           │ Dockerfile ↔ deps│    │    │ across files │
  │              │──────────▶│ CI ↔ scripts     │───▶│    └──────┬───────┘
  │ Build file   │           │ .env ↔ code refs │    │           │
  │ priority map │           │ imports ↔ reqs   │    │           ▼
  └──────────────┘           │ .gitignore ↔ out │    │    ┌──────────────┐
                             └──────────────────┘    │    │ Validate     │
                                                     │    │ consistency  │
                                                     │    └──────┬───────┘
                                                     │           │
                                                     │     Not   │  Done?
                                                     └─── yet ◀──┘

🚀 Quick Start

Install

# Via ClawHub
clawhub install autoforge

# Or clone
git clone https://github.com/akrimm702/autoforge.git
cp -r autoforge ~/.openclaw/workspace/skills/autoforge

Configure reporting (optional)

export AF_CHANNEL="telegram"        # telegram | discord | slack
export AF_CHAT_ID="-100XXXXXXXXXX"  # chat/group ID
export AF_TOPIC_ID="1234"           # thread ID (optional)

No env vars? Reports print to stdout with ANSI colors.

Tell your agent

Start autoforge mode: prompt for the coding-agent skill.
Evals: PTY handling correct? Workspace protection enforced? Clear structure?

The agent reads the skill, runs the loop, tracks everything in TSV, reports live, and stops when convergence math says it's done.

🔧 Four Modes

`prompt` — Mental Simulation

Simulates 5 realistic scenarios per iteration, evaluates Yes/No against defined evals, calculates pass rate mathematically. No code execution.

Best for: SKILL.md files, prompt engineering, documentation, briefing templates.

`code` — Real Execution

Runs code in a sandbox, measures exit codes, stdout, stderr, runtime. Evaluates against concrete test criteria.

Best for: Shell scripts, Python tools, data pipelines, build systems.

`audit` — CLI Testing

Tests documented commands against actual CLI behavior (--help, read-only). Catches docs-vs-reality drift. Two variants: Simple (2 iterations) or Deep (iterative with multi-model).

Best for: Verifying skill documentation matches real CLI behavior.

`project` — Whole Repository ⭐

Scans an entire repo, builds a file-map with priorities, runs cross-file consistency checks, and iteratively fixes issues across multiple files.

Best for: README ↔ CLI drift, Dockerfile ↔ dependency mismatches, CI ↔ project structure gaps.

Cross-file checks include:

README documents what the CLI actually does
Dockerfile installs the right dependency versions
CI workflows reference correct paths and scripts
.env.example covers all env vars used in code
Every import has a matching dependency declaration
.gitignore excludes build artifacts and secrets

📊 Live Reporting

After each iteration, report.sh sends live updates:

📊 AutoForge: coding-agent

📍 Iter 1   █████████░░░░░░░░░░░  45%
✅ Iter 2   ████████████░░░░░░░░  62%
✅ Iter 3   ███████████████░░░░░  78%
✅ Iter 4   █████████████████░░░  85%
✅ Iter 5   ██████████████████░░  90%
✅ Iter 6   ██████████████████░░  92%
✅ Iter 7   ██████████████████░░  92%
✅ Iter 8   ███████████████████░  95%
✅ Iter 9   ███████████████████░  95%
✅ Iter 10  ████████████████████  100%

──────────────────────
Iterations: 10  ✅ Keep: 10  ❌ Discard: 0
🏆 Best pass rate: 100% (Iter 10)

✅ Loop converged — improvement found

Every iteration is tracked in TSV:

iteration  prompt_version_summary       pass_rate  change_description              status
1          Baseline                     45%        Original SKILL.md baseline      baseline
2          Add missing subcommands      62%        16 Codex subcommands added      improved
3          Fix approval flags           78%        Scoped flags by context         improved
...
10         Validation pass              100%       All checks green                improved

📐 Convergence Rules

No vibes. No "looks good." Mathematical stop conditions:

Condition	Rule	Purpose
⬇️ Minimum iters	Must reach N before any stop	Prevents premature convergence
🛑 Max 30 iters	Hard safety cap	Cost protection
❌ 3× discard streak	Stop + analyze	Detects structural problems
✅ 3× 100% pass	Confirmed perfect	After minimum reached
➡️ 5× retained streak	Fully converged	No further improvement possible

Validator noise detection: In multi-model setups, validators can produce false positives. AutoForge recognizes config/path confusion, inverted checks, normal English flagged as forbidden references, and over-counting. After all real fixes, if >3 discards stem from non-reproducible complaints → declare convergence.

🔀 Multi-Model Cross-Validation

For complex audits, split optimizer and validator across different models:

Role	Example Models	Task
Optimizer	Claude Opus, GPT-4.1	Finds issues, writes fixes
Validator	GPT-5, Gemini	Checks against ground truth independently

The validator doesn't see the optimizer's reasoning — just the output. This prevents the "same model validates its own work" blind spot.

🏆 Real-World Results

Production runs, not demos.

coding-agent SKILL.md — 553 lines rewritten across 10 iterations. 16 Codex subcommands + 40 Claude CLI flags documented. 45% → 100%. Discovered --yolo was never a real flag.

ACP Router — 90% → 100% in 9 iterations. Agent coverage doubled from 6 to 12 harnesses. Thread spawn recovery policy written from scratch.

Sub-Agents Documentation — 70% → 100% in 14 iterations with multi-model validation. 6 real bugs found in upstream docs. Identified 4 categories of validator false positives.

backup.sh — Added rsync support, validation checks, restore-test. Code mode with real execution. 3 iterations to stable, 2 more to polish.

AutoForge on itself 🤯 — Self-forged in project mode: 67% → 100% in 8 iterations. Fixed 2 script bugs, cleaned config/doc inconsistencies across 7 files.

🗂️ Directory Structure

autoforge/
├── SKILL.md                ← OpenClaw skill definition
├── README.md               ← You are here
├── LICENSE                 ← MIT
├── .gitignore
├── scripts/
│   ├── report.sh           ← Live reporting (channel or stdout)
│   └── visualize.py        ← PNG progress chart generator
├── references/
│   ├── eval-examples.md    ← 200+ pre-built evals by category
│   └── ml-mode.md          ← ML training integration guide
├── examples/
│   ├── demo-results.tsv    ← Sample iteration data
│   └── example-config.json ← Reference template
└── results/                ← Your run data (gitignored)
    └── .gitkeep

⚙️ Configuration

AutoForge is configured entirely via environment variables. No config file needed.

Variable	Default	Description
`AF_CHANNEL`	`telegram`	Report delivery channel
`AF_CHAT_ID`	(none)	Chat/group ID. Unset = stdout
`AF_TOPIC_ID`	(none)	Thread/topic ID

Flag	Behavior
`--dry-run` (default)	Only TSV + proposed files. Target unchanged.
`--live`	Overwrites target. Auto-backup to `results/backups/`.
`--resume`	Continue from existing TSV.

🤝 Contributing

Fork → feature branch → PR
Run shellcheck scripts/report.sh and python3 -m py_compile scripts/visualize.py
Include real-world results from your own runs if possible

Good contributions: new eval templates, additional channel support, bug fixes with repro steps, production run case studies.

License

MIT — see LICENSE.

Built by [Alexander Krimm](https://github.com/akrimm702). Battle-tested across **50+ iterations** on **6 production skills**. *Stop reflecting. Start forging.* 🔨

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.

autoforge

# Description

# SKILL.md

AutoForge — Top-Agent Architecture

Overview

Sub-Agent = You

Multi-Model Setup (recommended for Deep Audits)

Configuration

Hard Invariants

Modes — Read ONLY Your Mode!

TSV Format (same for ALL modes)

Header (once at loop start):

Row per iteration:

5 columns, TAB-separated, EXACTLY this order:

Escaping rules:

Status rules (based on pass-rate comparison):

Reporting (same for ALL modes)

Stop Conditions (for ALL modes)

Counting rules:

Recognizing Validator Noise

Execution Modes

mode: prompt

Per Iteration: What you do

At the End

mode: code

Per Iteration: What you do

Code Eval Types

At the End

mode: audit

Two Variants

Simple Audit Flow

Deep Audit Flow

Fixed Evals (audit)

mode: project

Three Phases

Phase 1: Scan & Plan

Phase 2: Cross-File Analysis

Phase 3: Iterative Fix Loop

Per Iteration: What you do

Dry-Run vs Live

Default Evals (auto-applied unless overridden)

Eval Scoring

At the End

Directory Structure

Examples (task descriptions, NOT CLI commands)

Eval Examples → Mode Mapping

Tips

# README.md

🏗️ Architecture

Core Loop

Multi-Model Validation

Project Mode — Three Phases

🚀 Quick Start

🔧 Four Modes

prompt — Mental Simulation

code — Real Execution

audit — CLI Testing

project — Whole Repository ⭐

📊 Live Reporting

📐 Convergence Rules

🔀 Multi-Model Cross-Validation

🏆 Real-World Results

🗂️ Directory Structure

⚙️ Configuration

🤝 Contributing

License

# Related Skills

# Supported AI Coding Agents

Confirm

Submit a Skill

`prompt` — Mental Simulation

`code` — Real Execution

`audit` — CLI Testing

`project` — Whole Repository ⭐