Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add imachiever/my_genai_skills_and_agents --skill "agent-harness-patterns"
Install specific skill from multi-skill repository
# Description
|
# SKILL.md
name: agent-harness-patterns
description: |
Effective harness patterns for long-running autonomous agents. Use when building
Ralph-Loop style continuous development, agent evaluation frameworks, or multi-session
agent workflows. Includes initialization, progress tracking, feature management, and
testing strategies for production agent systems.
Agent Harness Patterns
Overview
This skill provides proven patterns for building effective harnesses for long-running autonomous agents. Based on Anthropic's engineering research and the Ralph-Loop framework.
Use when:
- Building autonomous coding agents (Ralph-Loop style)
- Implementing multi-session agent workflows
- Creating agent evaluation frameworks (Harbor, Terminal-Bench)
- Need agents to work across multiple context windows
- Building production agent systems with progress tracking
Core Problem: Context Window Bridging
Challenge: Agents struggle across multiple context windows because "each new session begins with no memory of what came before."
Solution: Two-part harness architecture with explicit environmental management.
Architecture Pattern: Initializer + Coding Agent
Part 1: Initializer Agent (First Run Only)
# initializer_agent.py
"""
Sets up foundational environment on first run.
Creates infrastructure for long-running agent work.
"""
from pathlib import Path
import subprocess
import json
def initialize_project():
"""Initialize project structure for long-running agent"""
# 1. Create init script
init_script = """#!/bin/bash
# Auto-generated by Initializer Agent
# Start development server
npm run dev &
DEV_PID=$!
# Store PID for cleanup
echo $DEV_PID > .dev-server.pid
echo "✅ Development server started (PID: $DEV_PID)"
"""
Path("init.sh").write_text(init_script)
Path("init.sh").chmod(0o755)
# 2. Create progress tracking file
progress_doc = {
"initialized_at": datetime.now().isoformat(),
"sessions": [],
"current_feature": None,
"notes": "Initial setup complete"
}
Path("claude-progress.txt").write_text(json.dumps(progress_doc, indent=2))
# 3. Create feature list (200+ granular features)
features = create_feature_list()
Path("features.json").write_text(json.dumps(features, indent=2))
# 4. Initial git commit
subprocess.run(["git", "add", "."])
subprocess.run(["git", "commit", "-m", "feat: Initialize agent harness environment"])
print("✅ Project initialized for long-running agent work")
def create_feature_list():
"""Create immutable feature list with 200+ granular features"""
return {
"version": "1.0.0",
"features": [
{
"id": "feat-001",
"name": "User authentication - Email/password",
"status": "failing",
"priority": "high",
"estimated_sessions": 1
},
{
"id": "feat-002",
"name": "User authentication - Password reset flow",
"status": "failing",
"priority": "high",
"estimated_sessions": 1
},
# ... 200+ more features
],
"metadata": {
"total_features": 200,
"completed": 0,
"in_progress": 0,
"failing": 200
}
}
Part 2: Coding Agent (Every Subsequent Session)
# coding_agent_harness.py
"""
Every session starts with initialization checklist, focuses on incremental progress.
"""
import json
import subprocess
from pathlib import Path
class CodingAgentHarness:
"""Harness for coding agent sessions with environmental management"""
def __init__(self):
self.progress_file = Path("claude-progress.txt")
self.features_file = Path("features.json")
def session_initialization_checklist(self):
"""
Run at start of EVERY session.
Saves tokens and quickly identifies broken states.
"""
print("🔍 Session Initialization Checklist")
# 1. Verify working directory access
assert Path.cwd().exists(), "Working directory not accessible"
print("✅ Working directory accessible")
# 2. Read git logs
git_log = subprocess.check_output(
["git", "log", "--oneline", "-10"],
text=True
)
print(f"📜 Last 10 commits:\n{git_log}")
# 3. Read progress documentation
progress = json.loads(self.progress_file.read_text())
print(f"📊 Progress: {progress['metadata']['completed']}/{progress['metadata']['total_features']} features complete")
# 4. Select highest-priority incomplete feature
features = json.loads(self.features_file.read_text())
next_feature = self._get_next_feature(features)
print(f"🎯 Next feature: {next_feature['id']} - {next_feature['name']}")
# 5. Run basic functionality tests
self._verify_basic_functionality()
print("✅ Basic functionality verified")
return next_feature
def _get_next_feature(self, features_data):
"""Select highest-priority incomplete feature"""
incomplete = [
f for f in features_data['features']
if f['status'] in ['failing', 'in_progress']
]
# Sort by priority
priority_order = {'critical': 0, 'high': 1, 'medium': 2, 'low': 3}
sorted_features = sorted(
incomplete,
key=lambda x: (priority_order.get(x['priority'], 99), x['id'])
)
return sorted_features[0] if sorted_features else None
def _verify_basic_functionality(self):
"""Run smoke tests before implementing new features"""
# Run basic tests to detect broken states
result = subprocess.run(
["npm", "test", "--", "--testPathPattern=smoke"],
capture_output=True
)
if result.returncode != 0:
print("⚠️ WARNING: Basic tests failing. Repair needed before new work.")
raise Exception("Broken state detected")
def incremental_progress_pattern(self, feature_id: str):
"""
Work on ONE feature per session with proper documentation.
Enables clean git-based rollbacks.
"""
# 1. Create feature branch
subprocess.run(["git", "checkout", "-b", f"feat/{feature_id}"])
# 2. Implement feature (agent work happens here)
print(f"🔨 Working on {feature_id}...")
# 3. Run end-to-end tests (CRITICAL - agents skip this without prompting)
self._run_e2e_tests(feature_id)
# 4. Update feature status
self._update_feature_status(feature_id, "passing")
# 5. Commit with descriptive message
subprocess.run(["git", "add", "."])
subprocess.run([
"git", "commit", "-m",
f"feat({feature_id}): Implement feature\n\nTests: passing\nProgress: documented"
])
# 6. Write progress summary
self._write_progress_summary(feature_id)
print(f"✅ Feature {feature_id} complete and committed")
def _run_e2e_tests(self, feature_id: str):
"""
Run end-to-end browser automation tests.
CRITICAL: Explicit prompting for E2E testing dramatically improves quality.
Agents tend to mark features complete without proper testing.
"""
print(f"🧪 Running E2E tests for {feature_id}...")
# Use Playwright/Puppeteer for browser automation
result = subprocess.run(
["npx", "playwright", "test", f"--grep={feature_id}"],
capture_output=True
)
if result.returncode != 0:
print(f"❌ E2E tests failed for {feature_id}")
print(result.stdout.decode())
raise Exception("E2E tests failed - feature incomplete")
print(f"✅ E2E tests passed for {feature_id}")
def _update_feature_status(self, feature_id: str, status: str):
"""Update feature status in immutable feature list"""
features = json.loads(self.features_file.read_text())
for feature in features['features']:
if feature['id'] == feature_id:
feature['status'] = status
feature['completed_at'] = datetime.now().isoformat()
break
# Update metadata counts
features['metadata']['completed'] = sum(
1 for f in features['features'] if f['status'] == 'passing'
)
features['metadata']['failing'] = sum(
1 for f in features['features'] if f['status'] == 'failing'
)
self.features_file.write_text(json.dumps(features, indent=2))
def _write_progress_summary(self, feature_id: str):
"""
Write progress summary enabling git-based rollbacks.
Eliminates time spent guessing what happened previously.
"""
progress = json.loads(self.progress_file.read_text())
session_summary = {
"session_id": len(progress['sessions']) + 1,
"timestamp": datetime.now().isoformat(),
"feature_id": feature_id,
"status": "completed",
"git_commit": subprocess.check_output(
["git", "rev-parse", "HEAD"],
text=True
).strip(),
"notes": f"Implemented {feature_id} with E2E tests"
}
progress['sessions'].append(session_summary)
progress['current_feature'] = None
self.progress_file.write_text(json.dumps(progress, indent=2))
# Usage in agent prompt
harness = CodingAgentHarness()
next_feature = harness.session_initialization_checklist()
harness.incremental_progress_pattern(next_feature['id'])
Ralph-Loop Pattern (Autonomous Continuous Development)
What is Ralph-Loop?
Ralph-Loop is an autonomous development loop using Claude Code hooks to prevent the agent from exiting until a specific completion promise is met.
From frankbria/ralph-claude-code:
"Ralph-Loop solves 'AI laziness' by using the Claude Code stop hook. Claude Code iteratively improves your project until completion, with built-in safeguards to prevent infinite loops and API overuse."
Implementation
# Install Ralph-Loop via official plugin
/plugin install ralph-wiggum@anthropic-plugins
# Ralph creates a 'User prompt submit hook' that:
# 1. Automatically injects '/ralph-loop' as user input
# 2. While in ralph-loop, infinitely injects '/continue' when Claude's turn ends
# 3. Exits only when agent declares completion promise met
Key Features:
- Autonomous iteration until task completion
- Built-in safeguards against infinite loops
- API cost tracking and limits
- Progress checkpoints for rollback
When to use Ralph-Loop:
- Multi-hour development tasks
- Refactoring large codebases
- Implementing complex features requiring iteration
- When you can monitor but not actively participate
Safety Considerations:
- Set max iteration limits
- Monitor API costs
- Use with feature lists (prevents premature completion)
- Test in non-production environments first
Feature List Best Practices
Structure (JSON Format)
{
"version": "1.0.0",
"immutable": true,
"features": [
{
"id": "feat-001",
"name": "User authentication - Email/password",
"status": "failing",
"priority": "critical",
"acceptance_criteria": [
"User can register with email/password",
"Password must meet security requirements (8+ chars, mixed case, number)",
"Email verification sent after registration",
"User can login with verified email",
"Session persists across page refreshes"
],
"estimated_sessions": 1,
"dependencies": [],
"tests_required": ["unit", "integration", "e2e"]
}
],
"metadata": {
"total_features": 200,
"completed": 0,
"in_progress": 0,
"failing": 200,
"estimated_total_sessions": 150
}
}
Key Principles
- 200+ Granular Features - Prevents agents from declaring victory prematurely
- Immutable - File should only be updated for status changes, not content
- Explicit Acceptance Criteria - No ambiguity about "done"
- Clear Dependencies - Agents know what must be completed first
- Testing Requirements - Forces proper test coverage
Testing Strategy (CRITICAL)
The Problem
"Agents tend to mark features complete without proper testing unless specifically instructed otherwise."
The Solution: Explicit E2E Testing Prompts
# Add to agent instruction
TESTING_INSTRUCTION = """
CRITICAL TESTING REQUIREMENTS:
Before marking ANY feature complete, you MUST:
1. Write unit tests covering all new functions
2. Write integration tests for API endpoints
3. Write END-TO-END browser automation tests using Playwright
E2E Test Template:
```javascript
test('feat-XXX: [Feature name]', async ({ page }) => {
// 1. Navigate to page
await page.goto('http://localhost:3000/feature-page');
// 2. Interact with UI
await page.fill('[data-testid="input"]', 'test value');
await page.click('[data-testid="submit-button"]');
// 3. Verify expected behavior
await expect(page.locator('[data-testid="result"]')).toContainText('expected result');
// 4. Verify side effects (API calls, database, etc.)
const apiResponse = await page.waitForResponse(resp => resp.url().includes('/api/endpoint'));
expect(apiResponse.status()).toBe(200);
});
- Run ALL tests: npm test && npm run test:e2e
- Verify tests PASS before committing
- Document test results in commit message
DO NOT claim a feature is complete without passing tests.
"""
---
## Agent Evaluation Frameworks
### Harbor Framework
**Purpose:** Evaluate agents in containerized environments at scale.
From [Harbor Framework](https://harborframework.com):
- Standardized format for defining tasks and graders
- Infrastructure for running trials across cloud providers
- Evaluates Claude Code, OpenHands, Codex CLI, and more
**Key Concepts:**
- **Task:** Single test with defined inputs and success criteria
- **Trial:** Each attempt at a task
- **Grader:** Logic that scores agent performance
### Bloom (Behavioral Evaluations)
**Purpose:** Agentic framework for developing behavioral evaluations.
From [Anthropic Bloom](https://alignment.anthropic.com/2025/bloom-auto-evals/):
- Takes a researcher-specified behavior
- Quantifies its frequency and severity
- Automatically generates scenarios
- Reproducible and targeted evaluations
**Use cases:**
- Alignment testing
- Safety evaluations
- Behavioral auditing
---
## Progress Tracking Template
```markdown
# Claude Progress Log
## Project: [Project Name]
**Started:** 2026-01-25
**Last Updated:** [Auto-updated each session]
---
## Current Status
- **Total Features:** 200
- **Completed:** 15 (7.5%)
- **In Progress:** 1
- **Failing:** 184
**Current Feature:** feat-016 - User authentication - OAuth integration
---
## Session History
### Session 12 (2026-01-25 16:30)
**Feature:** feat-015 - User authentication - Password reset
**Status:** ✅ Completed
**Git Commit:** abc123f
**Tests:**
- Unit: ✅ 5/5 passing
- Integration: ✅ 3/3 passing
- E2E: ✅ 2/2 passing
**Notes:** Implemented password reset flow with email verification. All tests passing.
### Session 11 (2026-01-25 14:15)
**Feature:** feat-014 - User authentication - Email/password
**Status:** ✅ Completed
**Git Commit:** def456a
**Tests:**
- Unit: ✅ 8/8 passing
- Integration: ✅ 4/4 passing
- E2E: ✅ 3/3 passing
**Notes:** Basic auth working. Session persistence implemented.
---
## Blockers / Issues
- None currently
## Next Steps
1. Complete feat-016 (OAuth integration)
2. Begin feat-017 (User profile management)
---
## Git History Summary
```bash
abc123f feat(feat-015): Implement password reset flow
def456a feat(feat-014): Implement email/password authentication
...
---
## Prompt Templates for Long-Running Agents
### Initializer Agent Prompt
You are an Initializer Agent for a long-running autonomous coding project.
Your ONLY job is to set up the foundational environment:
- Create init.sh script for running the development server
- Create claude-progress.txt for tracking agent work
- Create features.json with 200+ granular features (all marked "failing")
- Make an initial git commit documenting all added files
After initialization, you will hand off to the Coding Agent for all future sessions.
DO NOT start implementing features. ONLY set up the infrastructure.
### Coding Agent Prompt
You are a Coding Agent working on a long-running project.
EVERY SESSION starts with this checklist:
1. ✅ Verify working directory access
2. ✅ Read git logs (last 10 commits)
3. ✅ Read claude-progress.txt
4. ✅ Select highest-priority incomplete feature from features.json
5. ✅ Run basic functionality tests (npm test -- --testPathPattern=smoke)
Then follow the INCREMENTAL PROGRESS PATTERN:
1. Work on ONE feature per session
2. Write unit + integration + E2E tests (REQUIRED)
3. Run ALL tests - must pass before committing
4. Commit with descriptive message including test results
5. Update features.json status
6. Write session summary to claude-progress.txt
CRITICAL RULES:
- DO NOT declare a feature complete without passing E2E tests
- DO NOT work on multiple features in one session
- DO NOT skip writing progress summaries
- DO use git for rollback if anything breaks
Your goal: Work through ALL 200+ features systematically, one session at a time.
```
Sources & References
Official Anthropic Resources
- Effective Harnesses for Long-Running Agents - Core harness patterns
- Demystifying Evals for AI Agents - Agent evaluation guide
- Bloom: Automated Behavioral Evaluations - Behavioral testing framework
- Building and Evaluating Auditing Agents - Alignment auditing
Community Implementations
- ralph-claude-code by frankbria - Autonomous AI development loop
- Ralph Loop Guide - 24/7 autonomous development
Evaluation Frameworks
- Harbor Framework - Agent evaluation at scale
- Terminal-Bench 2.0 & Harbor - Standardized agent testing
When to Use This Skill
✅ Use when:
- Building Ralph-Loop style continuous development workflows
- Creating multi-session agent projects (>10 sessions)
- Need agents to maintain context across days/weeks
- Implementing agent evaluation frameworks
- Building production autonomous agent systems
- Migrating from manual to agentic development workflows
❌ Don't use when:
- Single-session tasks (use regular Claude Code)
- Tasks completed in <1 hour
- Exploratory/research work without clear deliverables
- When you need tight control over every decision
Version: 1.0.0
Last Updated: January 25, 2026
Maintained by: Rajat Bhatia
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.