Use when you have a written implementation plan to execute in a separate session with review checkpoints
npx skills add grahama1970/agent-skills --skill "battle"
Install specific skill from multi-skill repository
# Description
>
# SKILL.md
name: battle
description: >
Red vs Blue team security competition orchestrator. Runs long-running overnight
battles with 1000s of interactions, scoring, and insight generation.
allowed-tools:
- Bash
- Read
triggers:
- battle
- thunderdome
- red vs blue
- overnight battle
- security competition
- red team vs blue team
metadata:
short-description: Red vs Blue team security competition
requires: docker
Battle Skill
Red vs Blue Team Security Competition Orchestrator
Pits a Red Team (attack) against a Blue Team (defense) in a long-running competitive loop. Each team leverages all .pi/skills to attack or defend a target codebase.
Architecture
Based on research into RvB framework, DARPA AIxCC, and Microsoft PyRIT:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Battle Orchestrator β
β - Game loop (RvB pattern) β
β - Concurrent Red/Blue execution β
β - Entropy-driven termination β
β - Checkpointing for overnight runs β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
ββββββ΄βββββ ββββββ΄βββββ
β Red Team β β Blue Teamβ
β (Thread) β β (Thread) β
ββββββββββββ€ ββββββββββββ€
β Skills: β β Skills: β
β - hack β β - anvil β
β - memory β β - memory β
ββββββββββββ ββββββββββββ
β β
ββββββββββββ¬ββββββββββββββββββββ
β
βββββββββββββββββ΄βββββββββββββββββββββ
β Digital Twin β
β βββββββββββββββββββββββββββββββ β
β β Mode: git_worktree β β
β β - Red attacks arena β β
β β - Blue patches workspace β β
β β - Cherry-pick to test β β
β βββββββββββββββββββββββββββββββ€ β
β β Mode: docker β β
β β - Isolated containers β β
β β - Battle network β β
β βββββββββββββββββββββββββββββββ€ β
β β Mode: qemu β β
β β - Emulated firmware β β
β β - GDB attach points β β
β βββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββ
Digital Twin Modes
The battle skill supports multiple target types through its Digital Twin system:
1. Source Code (git_worktree)
For battling over git repositories. Creates isolated git worktrees for each team.
./run.sh battle /path/to/repo --rounds 100
2. Docker Container (docker)
For battling over containerized applications. Spins up separate containers for each team.
# Using a Docker image
./run.sh battle --docker-image nginx:latest --rounds 100
# Using a Dockerfile in the target directory
./run.sh battle /path/with/Dockerfile --mode docker
3. Firmware/Microprocessor (qemu)
For battling over firmware and embedded systems. Boots firmware in QEMU emulator.
# Auto-detect architecture from ELF header
./run.sh battle firmware.elf --rounds 100
# Specify machine type explicitly
./run.sh battle firmware.bin --qemu-machine arm
./run.sh battle firmware.bin --qemu-machine riscv64
./run.sh battle bios.rom --qemu-machine x86_64
Supported QEMU machines:
- arm - ARM Cortex-M (STM32, etc.)
- aarch64 - ARM64
- riscv32/riscv64 - RISC-V
- x86_64/i386 - x86
- mips - MIPS (routers, embedded)
4. Copy Mode (fallback)
For non-git directories. Creates simple file copies for each team.
Commands
# Start a battle (10 rounds for testing)
./run.sh battle /path/to/codebase --rounds 10
# Start overnight battle (1000 rounds)
./run.sh battle /path/to/codebase --overnight
# Battle a Docker container
./run.sh battle --docker-image myapp:latest --rounds 100
# Battle firmware with QEMU
./run.sh battle firmware.bin --qemu-machine arm --rounds 100
# Check battle status
./run.sh status
# Resume interrupted battle
./run.sh resume <battle-id>
# Generate report from completed battle
./run.sh report <battle-id>
Scoring System (AIxCC-style)
| Metric | Weight | Description |
|---|---|---|
| Vulnerability Discovery | 1x | Red team finds vulnerability |
| Exploit Proof | +0.5x | Red team proves exploitability |
| Successful Patch | 3x | Blue team patches vulnerability |
| Time Decay | Variable | Faster responses score higher |
| Functionality Preserved | Required | Patches must not break code |
Scores
- TDSR (True Defense Success Rate): Vulnerabilities fixed AND code works
- FDSR (Fake Defense Success Rate): Attack blocked but code broken
- ASC (Attack Success Count): Total unique exploits discovered
Game Loop (Learning-Based)
Each round follows a learn β act β reflect pattern:
Round k:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. RESEARCH PHASE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Red Team: Blue Team: β
β - Recall past attack attempts - Recall past defenses β
β - Query /dogpile for new - Query /dogpile for β
β exploitation techniques hardening strategies β
β - Review opponent's patterns - Analyze attack evolution β
β (Budget: 3 research calls max) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. ACTION PHASE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Red Team Attack: Blue Team Defense: β
β - Execute learned strategy - Apply patches via anvil β
β - AFL++ fuzzing with coverage - Verify via QCOW2 overlay β
β - Collect crashes/findings - Run regression tests β
β - Tag findings with /taxonomy - Tag patches with /taxonomy β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. REFLECTION PHASE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Both Teams: β
β - Archive round episode (actions, outcomes, learnings) β
β - Store successful strategies in /memory β
β - Update belief about opponent's capabilities β
β - Evolve strategy for next round β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. SCORING & CHECKPOINT β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β - Calculate AIxCC-style scores β
β - Check termination conditions β
β - Save checkpoint (QEMU state + team memories) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Memory Architecture
Each team maintains isolated knowledge:
battle_red_<battle_id>/ battle_blue_<battle_id>/
βββ strategies/ βββ strategies/
β βββ successful_attacks β βββ successful_patches
β βββ failed_attempts β βββ broken_defenses
βββ research/ βββ research/
β βββ dogpile_results β βββ dogpile_results
βββ episodes/ βββ episodes/
β βββ round_001.json β βββ round_001.json
β βββ round_002.json β βββ round_002.json
βββ taxonomy/ βββ taxonomy/
βββ cwe_classifications βββ mitigation_types
βββ severity_scores βββ effectiveness_scores
Teams cannot access opponent's memory - this creates true adversarial learning.
Termination Conditions
Battle ends when ANY condition is met:
- Null Production: Both teams fail to generate new findings for 3 rounds
- Maximum Rounds: Configured limit reached
- Metric Convergence: Scores stable for 5 consecutive rounds
- Kill Switch: Manual termination via
./run.sh stop
Task Monitor Integration
Battles register with task-monitor for overnight progress tracking:
# View battle progress in TUI
.pi/skills/task-monitor/run.sh tui --filter battle
Report Output
After battle completion, generates:
- Executive Summary: Winner, key metrics, risk score
- Vulnerability Report: By severity, category, remediation status
- Attack Evolution: How Red team adapted over rounds
- Defense Timeline: Blue team improvements over time
- Recommendations: Prioritized security improvements
Leveraged Skills
| Skill | Team | Purpose |
|---|---|---|
| hack | Red | Scanning, auditing, exploitation |
| anvil | Blue | Multi-agent patching (Thunderdome) |
| memory | Both | Recall prior strategies |
| treesitter | Blue | Code structure analysis |
| taxonomy | Both | Classify findings |
| task-monitor | Orchestrator | Progress tracking |
| docker-ops | Both | Container management |
Example Battle
# Start 100-round battle on current project
./run.sh battle --target . --rounds 100
# Output:
# Battle ID: battle_20250128_221500
# Target: /home/user/project
# Rounds: 100
#
# Registering with task-monitor...
# Starting Round 1/100...
# [Red] Scanning target with hack...
# [Red] Found 3 potential vulnerabilities
# [Blue] Analyzing attack logs...
# [Blue] Generating patch for SQL injection...
# [Blue] Patch applied, running verification...
# Round 1 complete. Red: 3 pts, Blue: 9 pts
# ...
#
# Battle Complete!
# Winner: Blue Team (847 pts vs 423 pts)
# Report: ./reports/battle_20250128_221500.md
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.