agent-harness-patterns

by @imachiever in AI & LLM

# Install this skill:

npx skills add imachiever/my_genai_skills_and_agents --skill "agent-harness-patterns"

Install specific skill from multi-skill repository

# Description

# SKILL.md

name: agent-harness-patterns
description: |
Effective harness patterns for long-running autonomous agents. Use when building
Ralph-Loop style continuous development, agent evaluation frameworks, or multi-session
agent workflows. Includes initialization, progress tracking, feature management, and
testing strategies for production agent systems.

Agent Harness Patterns

Overview

This skill provides proven patterns for building effective harnesses for long-running autonomous agents. Based on Anthropic's engineering research and the Ralph-Loop framework.

Use when:
- Building autonomous coding agents (Ralph-Loop style)
- Implementing multi-session agent workflows
- Creating agent evaluation frameworks (Harbor, Terminal-Bench)
- Need agents to work across multiple context windows
- Building production agent systems with progress tracking

Core Problem: Context Window Bridging

Challenge: Agents struggle across multiple context windows because "each new session begins with no memory of what came before."

Solution: Two-part harness architecture with explicit environmental management.

Architecture Pattern: Initializer + Coding Agent

Part 1: Initializer Agent (First Run Only)

# initializer_agent.py
"""
Sets up foundational environment on first run.
Creates infrastructure for long-running agent work.
"""

from pathlib import Path
import subprocess
import json

def initialize_project():
    """Initialize project structure for long-running agent"""

    # 1. Create init script
    init_script = """#!/bin/bash
# Auto-generated by Initializer Agent

# Start development server
npm run dev &
DEV_PID=$!

# Store PID for cleanup
echo $DEV_PID > .dev-server.pid

echo "✅ Development server started (PID: $DEV_PID)"
    """

    Path("init.sh").write_text(init_script)
    Path("init.sh").chmod(0o755)

    # 2. Create progress tracking file
    progress_doc = {
        "initialized_at": datetime.now().isoformat(),
        "sessions": [],
        "current_feature": None,
        "notes": "Initial setup complete"
    }

    Path("claude-progress.txt").write_text(json.dumps(progress_doc, indent=2))

    # 3. Create feature list (200+ granular features)
    features = create_feature_list()
    Path("features.json").write_text(json.dumps(features, indent=2))

    # 4. Initial git commit
    subprocess.run(["git", "add", "."])
    subprocess.run(["git", "commit", "-m", "feat: Initialize agent harness environment"])

    print("✅ Project initialized for long-running agent work")

def create_feature_list():
    """Create immutable feature list with 200+ granular features"""
    return {
        "version": "1.0.0",
        "features": [
            {
                "id": "feat-001",
                "name": "User authentication - Email/password",
                "status": "failing",
                "priority": "high",
                "estimated_sessions": 1
            },
            {
                "id": "feat-002",
                "name": "User authentication - Password reset flow",
                "status": "failing",
                "priority": "high",
                "estimated_sessions": 1
            },
            # ... 200+ more features
        ],
        "metadata": {
            "total_features": 200,
            "completed": 0,
            "in_progress": 0,
            "failing": 200
        }
    }

Part 2: Coding Agent (Every Subsequent Session)

# coding_agent_harness.py
"""
Every session starts with initialization checklist, focuses on incremental progress.
"""

import json
import subprocess
from pathlib import Path

class CodingAgentHarness:
    """Harness for coding agent sessions with environmental management"""

    def __init__(self):
        self.progress_file = Path("claude-progress.txt")
        self.features_file = Path("features.json")

    def session_initialization_checklist(self):
        """
        Run at start of EVERY session.
        Saves tokens and quickly identifies broken states.
        """
        print("🔍 Session Initialization Checklist")

        # 1. Verify working directory access
        assert Path.cwd().exists(), "Working directory not accessible"
        print("✅ Working directory accessible")

        # 2. Read git logs
        git_log = subprocess.check_output(
            ["git", "log", "--oneline", "-10"],
            text=True
        )
        print(f"📜 Last 10 commits:\n{git_log}")

        # 3. Read progress documentation
        progress = json.loads(self.progress_file.read_text())
        print(f"📊 Progress: {progress['metadata']['completed']}/{progress['metadata']['total_features']} features complete")

        # 4. Select highest-priority incomplete feature
        features = json.loads(self.features_file.read_text())
        next_feature = self._get_next_feature(features)
        print(f"🎯 Next feature: {next_feature['id']} - {next_feature['name']}")

        # 5. Run basic functionality tests
        self._verify_basic_functionality()
        print("✅ Basic functionality verified")

        return next_feature

    def _get_next_feature(self, features_data):
        """Select highest-priority incomplete feature"""
        incomplete = [
            f for f in features_data['features']
            if f['status'] in ['failing', 'in_progress']
        ]

        # Sort by priority
        priority_order = {'critical': 0, 'high': 1, 'medium': 2, 'low': 3}
        sorted_features = sorted(
            incomplete,
            key=lambda x: (priority_order.get(x['priority'], 99), x['id'])
        )

        return sorted_features[0] if sorted_features else None

    def _verify_basic_functionality(self):
        """Run smoke tests before implementing new features"""
        # Run basic tests to detect broken states
        result = subprocess.run(
            ["npm", "test", "--", "--testPathPattern=smoke"],
            capture_output=True
        )

        if result.returncode != 0:
            print("⚠️ WARNING: Basic tests failing. Repair needed before new work.")
            raise Exception("Broken state detected")

    def incremental_progress_pattern(self, feature_id: str):
        """
        Work on ONE feature per session with proper documentation.
        Enables clean git-based rollbacks.
        """
        # 1. Create feature branch
        subprocess.run(["git", "checkout", "-b", f"feat/{feature_id}"])

        # 2. Implement feature (agent work happens here)
        print(f"🔨 Working on {feature_id}...")

        # 3. Run end-to-end tests (CRITICAL - agents skip this without prompting)
        self._run_e2e_tests(feature_id)

        # 4. Update feature status
        self._update_feature_status(feature_id, "passing")

        # 5. Commit with descriptive message
        subprocess.run(["git", "add", "."])
        subprocess.run([
            "git", "commit", "-m",
            f"feat({feature_id}): Implement feature\n\nTests: passing\nProgress: documented"
        ])

        # 6. Write progress summary
        self._write_progress_summary(feature_id)

        print(f"✅ Feature {feature_id} complete and committed")

    def _run_e2e_tests(self, feature_id: str):
        """
        Run end-to-end browser automation tests.

        CRITICAL: Explicit prompting for E2E testing dramatically improves quality.
        Agents tend to mark features complete without proper testing.
        """
        print(f"🧪 Running E2E tests for {feature_id}...")

        # Use Playwright/Puppeteer for browser automation
        result = subprocess.run(
            ["npx", "playwright", "test", f"--grep={feature_id}"],
            capture_output=True
        )

        if result.returncode != 0:
            print(f"❌ E2E tests failed for {feature_id}")
            print(result.stdout.decode())
            raise Exception("E2E tests failed - feature incomplete")

        print(f"✅ E2E tests passed for {feature_id}")

    def _update_feature_status(self, feature_id: str, status: str):
        """Update feature status in immutable feature list"""
        features = json.loads(self.features_file.read_text())

        for feature in features['features']:
            if feature['id'] == feature_id:
                feature['status'] = status
                feature['completed_at'] = datetime.now().isoformat()
                break

        # Update metadata counts
        features['metadata']['completed'] = sum(
            1 for f in features['features'] if f['status'] == 'passing'
        )
        features['metadata']['failing'] = sum(
            1 for f in features['features'] if f['status'] == 'failing'
        )

        self.features_file.write_text(json.dumps(features, indent=2))

    def _write_progress_summary(self, feature_id: str):
        """
        Write progress summary enabling git-based rollbacks.
        Eliminates time spent guessing what happened previously.
        """
        progress = json.loads(self.progress_file.read_text())

        session_summary = {
            "session_id": len(progress['sessions']) + 1,
            "timestamp": datetime.now().isoformat(),
            "feature_id": feature_id,
            "status": "completed",
            "git_commit": subprocess.check_output(
                ["git", "rev-parse", "HEAD"],
                text=True
            ).strip(),
            "notes": f"Implemented {feature_id} with E2E tests"
        }

        progress['sessions'].append(session_summary)
        progress['current_feature'] = None

        self.progress_file.write_text(json.dumps(progress, indent=2))

# Usage in agent prompt
harness = CodingAgentHarness()
next_feature = harness.session_initialization_checklist()
harness.incremental_progress_pattern(next_feature['id'])

Ralph-Loop Pattern (Autonomous Continuous Development)

What is Ralph-Loop?

Ralph-Loop is an autonomous development loop using Claude Code hooks to prevent the agent from exiting until a specific completion promise is met.

From frankbria/ralph-claude-code:

"Ralph-Loop solves 'AI laziness' by using the Claude Code stop hook. Claude Code iteratively improves your project until completion, with built-in safeguards to prevent infinite loops and API overuse."

Implementation

# Install Ralph-Loop via official plugin
/plugin install ralph-wiggum@anthropic-plugins

# Ralph creates a 'User prompt submit hook' that:
# 1. Automatically injects '/ralph-loop' as user input
# 2. While in ralph-loop, infinitely injects '/continue' when Claude's turn ends
# 3. Exits only when agent declares completion promise met

Key Features:
- Autonomous iteration until task completion
- Built-in safeguards against infinite loops
- API cost tracking and limits
- Progress checkpoints for rollback

When to use Ralph-Loop:
- Multi-hour development tasks
- Refactoring large codebases
- Implementing complex features requiring iteration
- When you can monitor but not actively participate

Safety Considerations:
- Set max iteration limits
- Monitor API costs
- Use with feature lists (prevents premature completion)
- Test in non-production environments first

Feature List Best Practices

Structure (JSON Format)

{
  "version": "1.0.0",
  "immutable": true,
  "features": [
    {
      "id": "feat-001",
      "name": "User authentication - Email/password",
      "status": "failing",
      "priority": "critical",
      "acceptance_criteria": [
        "User can register with email/password",
        "Password must meet security requirements (8+ chars, mixed case, number)",
        "Email verification sent after registration",
        "User can login with verified email",
        "Session persists across page refreshes"
      ],
      "estimated_sessions": 1,
      "dependencies": [],
      "tests_required": ["unit", "integration", "e2e"]
    }
  ],
  "metadata": {
    "total_features": 200,
    "completed": 0,
    "in_progress": 0,
    "failing": 200,
    "estimated_total_sessions": 150
  }
}

Key Principles

200+ Granular Features - Prevents agents from declaring victory prematurely
Immutable - File should only be updated for status changes, not content
Explicit Acceptance Criteria - No ambiguity about "done"
Clear Dependencies - Agents know what must be completed first
Testing Requirements - Forces proper test coverage

Testing Strategy (CRITICAL)

The Problem

"Agents tend to mark features complete without proper testing unless specifically instructed otherwise."

The Solution: Explicit E2E Testing Prompts

# Add to agent instruction
TESTING_INSTRUCTION = """
CRITICAL TESTING REQUIREMENTS:

Before marking ANY feature complete, you MUST:

1. Write unit tests covering all new functions
2. Write integration tests for API endpoints
3. Write END-TO-END browser automation tests using Playwright

E2E Test Template:
```javascript
test('feat-XXX: [Feature name]', async ({ page }) => {
  // 1. Navigate to page
  await page.goto('http://localhost:3000/feature-page');

  // 2. Interact with UI
  await page.fill('[data-testid="input"]', 'test value');
  await page.click('[data-testid="submit-button"]');

  // 3. Verify expected behavior
  await expect(page.locator('[data-testid="result"]')).toContainText('expected result');

  // 4. Verify side effects (API calls, database, etc.)
  const apiResponse = await page.waitForResponse(resp => resp.url().includes('/api/endpoint'));
  expect(apiResponse.status()).toBe(200);
});

Run ALL tests: npm test && npm run test:e2e
Verify tests PASS before committing
Document test results in commit message

DO NOT claim a feature is complete without passing tests.
"""

---

## Agent Evaluation Frameworks

### Harbor Framework

**Purpose:** Evaluate agents in containerized environments at scale.

From [Harbor Framework](https://harborframework.com):
- Standardized format for defining tasks and graders
- Infrastructure for running trials across cloud providers
- Evaluates Claude Code, OpenHands, Codex CLI, and more

**Key Concepts:**
- **Task:** Single test with defined inputs and success criteria
- **Trial:** Each attempt at a task
- **Grader:** Logic that scores agent performance

### Bloom (Behavioral Evaluations)

**Purpose:** Agentic framework for developing behavioral evaluations.

From [Anthropic Bloom](https://alignment.anthropic.com/2025/bloom-auto-evals/):
- Takes a researcher-specified behavior
- Quantifies its frequency and severity
- Automatically generates scenarios
- Reproducible and targeted evaluations

**Use cases:**
- Alignment testing
- Safety evaluations
- Behavioral auditing

---

## Progress Tracking Template

```markdown
# Claude Progress Log

## Project: [Project Name]
**Started:** 2026-01-25
**Last Updated:** [Auto-updated each session]

---

## Current Status
- **Total Features:** 200
- **Completed:** 15 (7.5%)
- **In Progress:** 1
- **Failing:** 184

**Current Feature:** feat-016 - User authentication - OAuth integration

---

## Session History

### Session 12 (2026-01-25 16:30)
**Feature:** feat-015 - User authentication - Password reset
**Status:** ✅ Completed
**Git Commit:** abc123f
**Tests:**
- Unit: ✅ 5/5 passing
- Integration: ✅ 3/3 passing
- E2E: ✅ 2/2 passing
**Notes:** Implemented password reset flow with email verification. All tests passing.

### Session 11 (2026-01-25 14:15)
**Feature:** feat-014 - User authentication - Email/password
**Status:** ✅ Completed
**Git Commit:** def456a
**Tests:**
- Unit: ✅ 8/8 passing
- Integration: ✅ 4/4 passing
- E2E: ✅ 3/3 passing
**Notes:** Basic auth working. Session persistence implemented.

---

## Blockers / Issues
- None currently

## Next Steps
1. Complete feat-016 (OAuth integration)
2. Begin feat-017 (User profile management)

---

## Git History Summary
```bash
abc123f feat(feat-015): Implement password reset flow
def456a feat(feat-014): Implement email/password authentication
...

---

## Prompt Templates for Long-Running Agents

### Initializer Agent Prompt

You are an Initializer Agent for a long-running autonomous coding project.

Your ONLY job is to set up the foundational environment:

Create init.sh script for running the development server
Create claude-progress.txt for tracking agent work
Create features.json with 200+ granular features (all marked "failing")
Make an initial git commit documenting all added files

After initialization, you will hand off to the Coding Agent for all future sessions.

DO NOT start implementing features. ONLY set up the infrastructure.

### Coding Agent Prompt

You are a Coding Agent working on a long-running project.

EVERY SESSION starts with this checklist:
1. ✅ Verify working directory access
2. ✅ Read git logs (last 10 commits)
3. ✅ Read claude-progress.txt
4. ✅ Select highest-priority incomplete feature from features.json
5. ✅ Run basic functionality tests (npm test -- --testPathPattern=smoke)

Then follow the INCREMENTAL PROGRESS PATTERN:
1. Work on ONE feature per session
2. Write unit + integration + E2E tests (REQUIRED)
3. Run ALL tests - must pass before committing
4. Commit with descriptive message including test results
5. Update features.json status
6. Write session summary to claude-progress.txt

CRITICAL RULES:
- DO NOT declare a feature complete without passing E2E tests
- DO NOT work on multiple features in one session
- DO NOT skip writing progress summaries
- DO use git for rollback if anything breaks

Your goal: Work through ALL 200+ features systematically, one session at a time.
```

Sources & References

Official Anthropic Resources

Effective Harnesses for Long-Running Agents - Core harness patterns
Demystifying Evals for AI Agents - Agent evaluation guide
Bloom: Automated Behavioral Evaluations - Behavioral testing framework
Building and Evaluating Auditing Agents - Alignment auditing

Community Implementations

ralph-claude-code by frankbria - Autonomous AI development loop
Ralph Loop Guide - 24/7 autonomous development

Evaluation Frameworks

Harbor Framework - Agent evaluation at scale
Terminal-Bench 2.0 & Harbor - Standardized agent testing

When to Use This Skill

✅ Use when:
- Building Ralph-Loop style continuous development workflows
- Creating multi-session agent projects (>10 sessions)
- Need agents to maintain context across days/weeks
- Implementing agent evaluation frameworks
- Building production autonomous agent systems
- Migrating from manual to agentic development workflows

❌ Don't use when:
- Single-session tasks (use regular Claude Code)
- Tasks completed in <1 hour
- Exploratory/research work without clear deliverables
- When you need tight control over every decision

Version: 1.0.0
Last Updated: January 25, 2026
Maintained by: Rajat Bhatia

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.