Use when you have a written implementation plan to execute in a separate session with review checkpoints
npx skills add grahama1970/agent-skills --skill "orchestrate"
Install specific skill from multi-skill repository
# Description
>
# SKILL.md
name: orchestrate
description: >
Orchestrate tasks from a 0N_TASKS.md file with enforced memory-first pre-hooks,
quality-gate post-hooks, and session archiving. BLOCKS if unresolved questions exist.
Use when user says "run the tasks", "execute the task file", or "orchestrate".
allowed-tools: Bash, Read, orchestrate
triggers:
- orchestrate
- run the tasks
- execute the task file
- run 0N_TASKS
- execute tasks.md
- run each task
- start the task list
metadata:
short-description: Execute task list with enforced hooks (memory recall + quality gate)
Task Orchestration Skill
Execute tasks from a collaborative task file (e.g., 0N_TASKS.md) with enforced hooks:
- Questions/Blockers Gate: BLOCKS execution if unresolved questions exist
- Memory-first Pre-hook: Queries memory BEFORE each task (not optional)
- Quality-gate Post-hook: Runs tests AFTER each task (must pass)
- Session Archiving: Stores completed session for future recall
⚠️ Non-Negotiable: Sanity Scripts + Completion Tests
Without these, the orchestrator WILL hallucinate and errors WILL compound.
LLMs cannot reliably verify their own work. Without external validation:
- Agent "completes" Task 1 with subtle bug
- Task 2 builds on broken Task 1
- Task 3 compounds the errors
- By Task 5, the codebase is corrupted beyond repair
The ONLY defense: Working sanity scripts + completion tests that are DIVORCED from project complexity.
Every task file MUST include (via human-agent collaboration):
| Requirement | Purpose | When Created | Example |
|---|---|---|---|
| Sanity Script | Proves dependencies/APIs work IN ISOLATION | BEFORE implementation | sanity/camelot.py - extracts a table from a simple test PDF |
| Completion Test | Proves task succeeded with CONCRETE assertion | BEFORE implementation | test_table_extractor.py::test_extracts_3_tables |
Which Packages Need Sanity Scripts?
Only create sanity scripts for packages where the agent might hallucinate usage:
| Needs Sanity Script | Examples | Why |
|---|---|---|
| ✅ Little-known packages | camelot, pdfplumber, surya |
Agent may not know correct API |
| ✅ Complex APIs | transformers, opencv, paddleocr |
Many parameters, easy to get wrong |
| ✅ User/project-generated code | myproject.utils, custom modules |
Not in training data |
| ❌ Standard library | json, os, pathlib, typing |
Well-documented, agent knows these |
| ❌ Well-known packages | requests, numpy, pandas |
Widely used, agent reliable here |
Why "Divorced from Project Complexity"?
Sanity scripts must test the CORE FUNCTIONALITY in isolation:
# GOOD: Tests Camelot API works (little-known package)
# sanity/camelot_tables.py
import camelot
tables = camelot.read_pdf("fixtures/simple_table.pdf", flavor="lattice")
assert len(tables) > 0, "Camelot failed to extract any tables"
print(f"PASS: Extracted {len(tables)} tables, accuracy={tables[0].parsing_report['accuracy']}")
# BAD: Testing json.loads (standard library - agent knows this)
# sanity/json_parsing.py
import json
data = json.loads('{"key": "value"}') # Pointless - agent won't hallucinate this
# BAD: Tests your whole pipeline, hides where failure occurs
# sanity/camelot_tables.py
from myproject.pipeline import extract_tables # Too coupled!
result = extract_tables("complex_document.pdf") # Too complex!
Rule of thumb: If you'd trust a junior developer to use the API correctly from memory, skip the sanity script. If YOU had to look up the docs, create one.
The task file is INCOMPLETE without both. Do not proceed to implementation until:
- Sanity scripts pass (dependencies verified IN ISOLATION)
- Completion tests are defined (Definition of Done with CONCRETE assertions)
Run pre-flight check: ./preflight.sh 01_TASKS.md
This is collaborative work - agent proposes, human verifies/refines.
The Collaborative Workflow
flowchart TB
subgraph Phase1["PHASE 1: Collaborate on Task File"]
H1[Human: I need to refactor auth] --> A1[Agent creates 0N_TASKS.md]
A1 --> Q1["## Questions/Blockers<br/>- Which auth method?<br/>- Backwards compat?"]
Q1 --> H2[Human answers questions]
H2 --> S1["Create SANITY SCRIPTS<br/>for non-standard deps"]
S1 --> S2{Sanity<br/>scripts pass?}
S2 -->|No| S3[Fix deps/scripts<br/>with human]
S3 --> S1
S2 -->|Yes| T1["Define COMPLETION TESTS<br/>for each task"]
T1 --> T2[Human approves<br/>Definition of Done]
T2 --> Q2["Questions resolved → None<br/>Sanity: PASS<br/>Tests: DEFINED"]
end
subgraph Phase2["PHASE 2: Execute via orchestrate tool"]
O1["orchestrate({ taskFile })"] --> Check{Blockers<br/>exist?}
Check -->|Yes| Block[BLOCKED - Resolve first]
Check -->|No| SanityCheck{Sanity<br/>scripts pass?}
SanityCheck -->|No| Block2[BLOCKED - Fix sanity first]
SanityCheck -->|Yes| Loop["For each task:"]
Loop --> Pre["PRE-HOOK: memory recall"]
Pre --> Exec["Execute: pi --no-session"]
Exec --> Post["POST-HOOK: Run Definition of Done test"]
Post --> Pass{Test<br/>passes?}
Pass -->|Yes| Update["Update: [ ] → [x]"]
Pass -->|No| Retry{Attempts < 3?}
Retry -->|Yes| Fix[Analyze, fix, retry]
Fix --> Exec
Retry -->|No| Fail[HARD FAIL - Human needed]
Update --> Next{More<br/>tasks?}
Next -->|Yes| Loop
Next -->|No| Done[All complete]
end
subgraph Phase3["PHASE 3: Archive"]
Archive["episodic-archiver<br/>→ Future recall enabled"]
end
Phase1 --> Phase2
Done --> Phase3
Human-Agent Collaboration: Creating Sanity Scripts + Tests
Before ANY implementation begins, agent and human collaborate to create:
Step 1: Identify Dependencies → Create Sanity Scripts
Agent: "Task 2 requires Camelot for table extraction. I'll create a sanity script."
Agent: [Creates sanity/camelot_tables.py with working example]
Agent: "Please verify this works in your environment: python sanity/camelot_tables.py"
Human: [Runs script] "It fails - ghostscript not installed"
Agent: [Updates script with installation check and clearer error]
Human: [Runs again] "PASS - extracts 3 tables correctly"
Agent: [Marks sanity script as verified in task file]
Step 2: Define Completion Tests → Definition of Done
Agent: "For Task 2 (extract tables from PDF), what proves success?"
Human: "It should find all 5 tables in the test PDF and preserve headers"
Agent: "I'll define the test:
- Test: tests/test_table_extractor.py::test_extracts_all_tables
- Assertion: Extracts exactly 5 tables from fixtures/sample.pdf with headers intact"
Human: "Add a test for empty PDFs too"
Agent: [Updates Definition of Done with both tests]
Step 3: Write Failing Tests FIRST
Agent: [Creates test file with tests that will FAIL]
Agent: "Tests are written but failing (as expected). Ready to implement?"
Human: "Yes, proceed"
Agent: [Implements feature, runs tests, they pass]
Key principle: No surprises. Human knows exactly what "done" means before coding starts.
Pre-Flight Checklist (Runs Before Any Task Executes)
Before executing ANY task, the orchestrator runs a pre-flight check:
flowchart TB
Start[orchestrate called] --> PF["PRE-FLIGHT CHECK"]
PF --> Q{Questions/<br/>Blockers?}
Q -->|Yes| Block1[❌ BLOCKED: Resolve questions first]
Q -->|No| S{Sanity scripts<br/>exist?}
S -->|Missing| Block2[❌ BLOCKED: Create sanity scripts first]
S -->|Exist| SP{Sanity scripts<br/>PASS?}
SP -->|Fail| Block3[❌ BLOCKED: Fix sanity scripts first]
SP -->|Pass| T{Definition of Done<br/>defined for all tasks?}
T -->|Missing| Block4[❌ BLOCKED: Define completion tests first]
T -->|Defined| TF{Test files<br/>exist?}
TF -->|Missing| Block5[❌ BLOCKED: Create test files first]
TF -->|Exist| Ready[✅ PRE-FLIGHT PASS<br/>Begin execution]
Pre-Flight Checklist Items
| Check | What It Validates | Failure Action |
|---|---|---|
| 1. Questions/Blockers | No unresolved items in section | BLOCK - collaborate to resolve |
| 2. Sanity Scripts Exist | Each dependency in table has a script file | BLOCK - create scripts with human |
| 3. Sanity Scripts Pass | python sanity/*.py all exit 0 |
BLOCK - fix deps/scripts with human |
| 4. Definition of Done Defined | Each implementation task has Test + Assertion | BLOCK - define tests with human |
| 5. Test Files Exist | Referenced test files actually exist | BLOCK - create test files first |
Pre-Flight Output
=== PRE-FLIGHT CHECK: 01_TASKS.md ===
[1/5] Questions/Blockers... ✅ None
[2/5] Sanity scripts exist...
- sanity/camelot_tables.py ✅
- sanity/pdfplumber_tables.py ✅
[3/5] Sanity scripts pass...
- sanity/camelot_tables.py ✅ (exit 0)
- sanity/pdfplumber_tables.py ❌ (exit 1: ghostscript missing)
❌ PRE-FLIGHT FAILED: Sanity script failed
Fix: Install ghostscript or update sanity script
Cannot proceed until all sanity scripts pass.
Why Pre-Flight Matters
Without pre-flight:
- Task 1 executes successfully
- Task 2 starts, needs Camelot
- Camelot fails (ghostscript missing)
- Task 2 fails, error compounds
- Task 3 depends on Task 2, also fails
- Wasted effort, corrupted state
With pre-flight:
- Check sanity scripts BEFORE any execution
- Camelot sanity fails immediately
- Human fixes ghostscript
- Re-run pre-flight, all pass
- NOW execute tasks with confidence
Pre-flight is cheap. Failed tasks are expensive.
Critical: Questions/Blockers Section
The orchestrator BLOCKS execution if unresolved questions exist:
## Questions/Blockers
- Which database should we use? (blocks Task 3)
- Do we need backwards compatibility?
To proceed: Answer the questions and either:
- Remove the items
- Change to "None" or "N/A"
This forces collaborative clarification BEFORE coding starts.
Sanity-First Collaboration (Crucial Dependencies)
NEW: For non-standard APIs, create sanity scripts BEFORE marking Questions/Blockers as resolved.
When a task requires libraries/APIs beyond standard ones (json, pathlib, typing, etc.), the agent must:
Phase 1a: Dependency Identification
flowchart LR
A[Task identified] --> B{Non-standard APIs?}
B -->|No| C[Skip to Questions]
B -->|Yes| D[Research with skills]
D --> E[Create sanity script]
E --> F[Human verifies]
F --> G[Then resolve Questions]
Research Skill Priority
- brave-search (free) - General patterns, StackOverflow, blog posts
- Context7 (free) - Library-specific documentation chunks
- perplexity (paid) - Complex research, comparisons (use sparingly)
Sanity Script Requirements
Each non-standard dependency gets a script in tools/tasks_loop/sanity/:
# sanity/{library}.py - Agent REFERENCES this when implementing
"""
PURPOSE: Working example with correct parameters
DOCUMENTATION: Context7 query used, last verified date
"""
# Must show: imports, parameters with values, expected output, edge cases
# Exit codes: 0=PASS, 1=FAIL, 42=CLARIFY (needs human)
Example: Before using Camelot for table extraction, create sanity/camelot_table_extraction.py that shows:
- Both
latticeandstreammodes line_scale,edge_tol,row_tolparameters with valid values- How to check accuracy scores
- Known issues (ghostscript dependency, etc.)
Task File with Dependencies
## Crucial Dependencies
| Library | API/Method | Sanity Script | Status |
| ---------- | ------------------ | ---------------------- | ------------ |
| camelot | `read_pdf()` | `sanity/camelot.py` | [x] verified |
| pdfplumber | `extract_tables()` | `sanity/pdfplumber.py` | [ ] pending |
## Questions/Blockers
- [ ] All sanity scripts must pass before this resolves to "None"
Why This Matters
- Agent learns from working examples - Not just "use camelot" but exactly how
- Parameters are documented -
line_scale=40with explanation of why - Edge cases captured - "Needs ghostscript installed"
- Human verification - Confirms the script actually works
- Future agents benefit - Sanity scripts persist for recall
Task File Format: 0N_TASKS.md
# Task List: <Project/Feature Name>
## Context
<Brief description of what we're trying to accomplish>
## Crucial Dependencies (Sanity Scripts)
| Library | API/Method | Sanity Script | Status |
| ---------- | ------------------ | ----------------------------- | ----------- |
| camelot | `read_pdf()` | `sanity/camelot_tables.py` | [x] PASS |
| pdfplumber | `extract_tables()` | `sanity/pdfplumber_tables.py` | [ ] PENDING |
> ⚠️ All sanity scripts must PASS before proceeding to implementation.
## Tasks
- [ ] **Task 1**: <Clear, actionable description>
- Agent: general-purpose
- Parallel: 0
- Dependencies: none
- Notes: <any context>
- **Sanity**: `sanity/camelot_tables.py` (must pass first)
- **Definition of Done**:
- Test: `tests/test_feature.py::test_task1_behavior`
- Assertion: <what the test proves>
- [ ] **Task 2**: <Description>
- Agent: general-purpose
- Parallel: 1
- Dependencies: Task 1
- Notes: <context>
- **Sanity**: None (uses json, pathlib, requests - well-known APIs)
- **Definition of Done**:
- Test: `tests/test_feature.py::test_task2_behavior`
- Assertion: <what the test proves>
- [ ] **Task 3**: <Description>
- Agent: explore
- Parallel: 1
- Dependencies: none
- **Sanity**: N/A (research only)
- **Definition of Done**: N/A (research only, no code changes)
## Completion Criteria
<How do we know we're done?>
## Questions/Blockers
None - all questions resolved, all sanity scripts pass.
The "Definition of Done" Field
Every implementation task MUST have a Definition of Done that specifies:
- Test file/function: The exact test that verifies this task
- Assertion: What the test proves (in plain English)
Examples:
- **Definition of Done**:
- Test: `tests/core/providers/test_image.py::test_vlm_fallback_to_ocr`
- Assertion: When VLM returns garbage (<100 chars), OCR fallback is triggered
- **Definition of Done**:
- Test: `tests/api/test_auth.py::test_refresh_token_expired`
- Assertion: Expired refresh tokens return 401 and clear session
- **Definition of Done**: N/A (documentation only)
If no test exists, the task file should note this:
- **Definition of Done**:
- Test: MISSING - must be created before implementation
- Assertion: <describe what we need to verify>
This forces collaborative discussion about what "done" means BEFORE coding starts.
Test = Gate Enforcement:
When you specify a test file in the Test: field, the orchestrator AUTOMATICALLY enables "Retry Until Pass" mode.
- It treats the test file as a Quality Gate.
- The agent will be forced to Loop (analyze -> fix -> retry) up to 3 times (default) until that specific test passes.
- This prevents "hallucinated completion" where the agent says "I fixed it" but didn't run the test.
The orchestrate Tool
Basic Usage
orchestrate({
taskFile: "01_TASKS.md", // Path to task file
continueOnError: false, // Stop on first failure (default)
archive: true, // Archive on completion (default)
taskTimeoutMs: 1800000, // 30 min per task (default)
});
What Happens
-
PRE-FLIGHT CHECK (MANDATORY - runs
./preflight.sh): -
❌ Questions/Blockers exist? → BLOCKED
- ❌ Sanity scripts missing? → BLOCKED
- ❌ Sanity scripts fail? → BLOCKED
- ❌ Definition of Done missing? → BLOCKED
- ❌ Test files missing? → BLOCKED
-
✅ All checks pass → Proceed to execution
-
For Each Task:
-
PRE-HOOK:
~/.pi/agent/skills/memory/run.sh recall --q "<task>"- If solutions found → injected as context in task prompt
- Agent decides how to use prior knowledge
- EXECUTE:
pi --mode json -p --no-session "<task prompt>"- Protected context, no session bleed
- INSTRUCTION: "Run the Definition of Done test to verify before finishing"
- Agent config provides system prompt
- POST-HOOK: Run the specific Definition of Done test for this task
- NOT the whole test suite - just the task's specific test
- Task FAILS if test doesn't pass
- Retry up to 3 times before hard failure
-
UPDATE: Mark checkbox
[x]in task file -
Archive: Store session via episodic-archiver
If pre-flight fails, orchestrator REFUSES to execute. Fix issues first.
Memory Recall Context
When memory finds prior solutions, they're injected into the task prompt:
## Memory Recall (Prior Solutions Found)
The following relevant solutions were found in memory. Review and adapt as needed:
1. **Problem**: OAuth token refresh failing silently
**Solution**: Add explicit error handling in refreshToken(), log failures
---
## Context
...rest of task prompt...
The agent sees this context and decides whether to apply, adapt, or ignore it.
Quality Gate Enforcement
After each task, quality-gate.sh runs:
# Auto-detects project type and runs:
# - Python: pytest -q -x
# - Node: npm test
# - Go: go test ./...
# - Rust: cargo check
# - Makefile: make test (or make smokes)
If tests fail:
- Task status =
failed - Error output included in results
- Orchestration stops (unless
continueOnError: true)
When to Use
| Trigger | Action |
|---|---|
| "Let's plan this" | Collaborate on task file (don't run yet) |
| "Run the tasks" | Execute via orchestrate tool |
| "Orchestrate 01_TASKS.md" | Execute specific file |
| "Schedule nightly" | Schedule via orchestrate schedule |
| Unresolved questions | BLOCKED - clarify first |
Parallel Task Execution
Tasks can run in parallel groups using the Parallel field:
- [ ] **Task 1**: Setup database
- Parallel: 0 # Group 0 runs FIRST (sequentially before any parallel tasks)
- [ ] **Task 2**: Create API endpoints
- Parallel: 1 # Group 1 tasks run IN PARALLEL after Group 0 completes
- Dependencies: Task 1
- [ ] **Task 3**: Create frontend components
- Parallel: 1 # Also Group 1 - runs CONCURRENTLY with Task 2
- Dependencies: Task 1
- [ ] **Task 4**: Integration tests
- Parallel: 2 # Group 2 runs after ALL Group 1 tasks complete
- Dependencies: Task 2, Task 3
Execution Order:
1. All Parallel: 0 tasks run sequentially (respecting dependencies)
2. All Parallel: 1 tasks run concurrently (after their dependencies are met)
3. All Parallel: 2 tasks run concurrently (after Group 1 completes)
4. And so on...
Rules:
- Tasks in the same group with unmet dependencies wait until dependencies complete
- Lower parallel numbers run before higher numbers
- Default is Parallel: 0 (runs first, sequential)
Task-Monitor Integration
Orchestrate automatically pushes progress to the task-monitor TUI:
┌─────────────────────────────────────────────────────┐
│ orchestrate:01_TASKS.md:abc123 [=====> ] 3/5 │
│ Group 1: Create API endpoints, Create frontend │
│ Status: running | Success: 2 | Failed: 0 │
└─────────────────────────────────────────────────────┘
Configuration:
# Environment variables
TASK_MONITOR_API_URL=http://localhost:8765 # Default
TASK_MONITOR_ENABLED=true # Default, set to "false" to disable
Start the monitor TUI:
.pi/skills/task-monitor/run.sh tui
Scheduler Integration
Schedule recurring task file executions via the scheduler skill:
# Schedule nightly runs
orchestrate schedule 01_TASKS.md --cron "0 2 * * *"
# Schedule hourly
orchestrate schedule maintenance.md --cron "0 * * * *"
# Remove from schedule
orchestrate unschedule 01_TASKS.md
# View scheduled jobs
scheduler list
Automatic Registration:
- Completed orchestrations automatically register with the scheduler (disabled by default)
- Use orchestrate schedule to explicitly schedule with cron
Integration with task-monitor:
The task-monitor TUI shows both:
- Running orchestrations (real-time progress)
- Upcoming scheduled jobs (from ~/.pi/scheduler/jobs.json)
Agent Selection
Specify agent per task in the task file:
| Agent | Use For |
|---|---|
general-purpose |
Code changes, bug fixes, implementation |
explore |
Research, code exploration, finding patterns |
Agent configs live at ~/.pi/agent/agents/<name>.md with:
- Frontmatter: name, description, tools, model
- Body: System prompt with instructions
Example Full Flow
User: "I need to fix the auth bug and add tests"
Agent: "I'll create a task file. First, some questions:
- Which auth system? OAuth, JWT, or session?
- Unit tests or integration tests?"
User: "OAuth, unit tests"
Agent: [Creates 01_TASKS.md]
# Task List: Fix OAuth Auth Bug
## Context
Fix the OAuth token refresh bug and add unit tests.
## Tasks
- [ ] **Task 1**: Investigate OAuth token refresh failure
- Agent: explore
- Dependencies: none
- [ ] **Task 2**: Fix the token refresh logic
- Agent: general-purpose
- Dependencies: Task 1
- [ ] **Task 3**: Add unit tests for token refresh
- Agent: general-purpose
- Dependencies: Task 2
## Questions/Blockers
None - resolved above.
User: "Run the tasks"
Agent: [Calls orchestrate({ taskFile: "01_TASKS.md" })]
→ Task 1: Memory recall finds prior OAuth issues, injects context
Explore agent investigates, reports findings
Quality gate: N/A for explore (no code changes)
✓ Complete
→ Task 2: Memory recall provides solutions from Task 1
General-purpose fixes the bug
Quality gate: pytest runs, all tests pass
✓ Complete
→ Task 3: Memory recall finds test patterns
General-purpose adds tests
Quality gate: pytest runs, new tests pass
✓ Complete
→ Archive: Session stored to episodic memory
"All 3 tasks complete. Session archived for future recall."
Key Principles
- Clarify FIRST - Questions/Blockers section forces collaborative discussion
- Sanity scripts BEFORE implementation - Non-standard APIs get verified working examples
- Completion tests BEFORE implementation - Definition of Done specifies exact test + assertion
- Human-agent collaboration - Agent proposes, human verifies/refines both sanity and tests
- Memory BEFORE - Pre-hook always runs, provides context from prior solutions
- Quality AFTER - Post-hook runs the Definition of Done test, must pass
- Retry logic - 3 attempts before hard failure requiring human intervention
- Isolated Context - Each task runs in
--no-sessionmode - Archive at End - Enables future recall of solutions
The fundamental rule: A task file without sanity scripts AND completion tests is INCOMPLETE. Do not proceed to implementation without both.
Non-Negotiable: Per-Task Testing
Testing per task is non-negotiable. Without verification, errors and hallucinations compound across tasks.
Core Rules
- NEVER skip tests - Exit code 3 (skip) is NOT acceptable for implementation tasks
- Every task must have a verifiable test - If no test exists, CREATE ONE first
- Tests must actually run - Infrastructure unavailable = FAIL, not SKIP
- Retry before failing - Multiple attempts (up to 3) before marking task as failed
Test Requirement by Task Type
| Task Type | Test Requirement | Skip Allowed? |
|---|---|---|
| Implementation (code changes) | Unit/integration test MUST pass | NO |
| Bug fix | Regression test proving fix | NO |
| Research/explore | No code = no test required | N/A |
| Documentation only | Linting/format check | YES (graceful) |
The Test-First Enforcement Loop
flowchart TB
Start[Start Task] --> HasTest{Test exists<br/>for this feature?}
HasTest -->|No| CreateTest[BLOCK: Create test first]
CreateTest --> Collab[Collaborate with human<br/>on test requirements]
Collab --> WriteTest[Write test - expect FAIL]
WriteTest --> HasTest
HasTest -->|Yes| Implement[Implement feature]
Implement --> RunTest[Run quality-gate]
RunTest --> Result{Tests pass?}
Result -->|Yes| Done[Task COMPLETE]
Result -->|No| Retry{Attempts < 3?}
Retry -->|Yes| Fix[Analyze failure, fix code]
Fix --> RunTest
Retry -->|No| FAIL[Task FAILED<br/>Human intervention needed]
Retry Logic (3 Attempts)
For each task:
Attempt 1: Execute task, run tests
→ If tests pass: DONE
→ If tests fail: Analyze error, fix, continue to Attempt 2
Attempt 2: Apply fix, run tests
→ If tests pass: DONE
→ If tests fail: Deeper analysis, continue to Attempt 3
Attempt 3: Final attempt with more aggressive fix
→ If tests pass: DONE
→ If tests fail: HARD FAIL - request human help
What Counts as "Tests Pass"
- Exit code 0 from quality-gate.sh
- Exit code 3 (skip) is REJECTED for implementation tasks
- The specific test for the feature must be in the test output
- "0 tests ran" is a FAIL (means test infrastructure broken)
Creating Tests When Missing
If a task's Definition of Done shows Test: MISSING:
- STOP - Do not implement without a test
- Collaborate with human to define what "done" means:
- "For Task X, what behavior proves it's working?"
- "What are the edge cases we should verify?"
- "What's the minimum viable assertion?"
- Update the task file with the agreed Definition of Done:
```markdown - Definition of Done:
- Test:
tests/test_feature.py::test_new_behavior - Assertion: When X happens, Y should result
```
- Test:
- Write test FIRST - Test should FAIL initially (proves test works)
- Then implement - Make the test pass
- Verify - Run the specific test, confirm it passes
Example: Enforced Testing
## Task: Add VLM fallback for ImageProvider
Before implementing:
- [ ] Test exists: test_image_provider_vlm_fallback.py
- [ ] Test verifies: VLM failure triggers OCR fallback
- [ ] Test is RED (fails before implementation)
After implementing:
- [ ] Test is GREEN (passes)
- [ ] quality-gate.sh exits 0
- [ ] No skipped tests related to this feature
Infrastructure Requirements
Tests require infrastructure to run. If infrastructure is unavailable:
| Situation | Response |
|---|---|
| API server not running | FAIL (start the server, don't skip) |
| Database not connected | FAIL (fix connection, don't skip) |
| Browser not available | FAIL (use headless or fix setup) |
| Optional dependency missing | OK to skip THAT test, not all tests |
Principle: If tests can't run, the task can't be verified. Unverified = incomplete.
Testing Anti-Patterns: What NOT to Do
DO NOT create brittle mock-based tests. These are maintenance nightmares that break when implementation details change.
Banned Patterns
| Anti-Pattern | Why It's Bad | What to Do Instead |
|---|---|---|
| FakeDB classes | Query pattern changes break tests; doesn't verify real behavior | Use real database with test data |
| Mocked LLM responses | Verifies string matching, not actual LLM integration | Use real LLM calls (mark as @pytest.mark.llm) |
| Mocked embedding calls | Doesn't verify embedding service works | Call real embedding service |
| Complex monkeypatch chains | Fragile, hard to maintain, false confidence | Integration tests with real services |
Example: FakeDB Anti-Pattern (NEVER DO THIS)
# ❌ BAD - This test is WORTHLESS
class FakeDB:
def __init__(self, lessons, edges):
self._lessons = lessons
self._edges = edges
class AQLWrapper:
def execute(self, query, bind_vars=None):
# Fragile pattern matching that breaks when queries change
if "FILTER ed._to==@nid" in query: # What if query changes to IN?
# ... complex mock logic ...
pass
return FakeCursor([])
def test_cascade_with_fake_db(monkeypatch):
fake_db = FakeDB(lessons, edges)
monkeypatch.setattr(module, "get_db", lambda: fake_db) # ❌ Bypasses real DB
# This test passes but proves NOTHING about real behavior
Why this is bad:
- When query changes from
ed._to==@nidtoed._to IN @targets, test breaks - FakeDB logic diverges from real ArangoDB behavior
- Tests pass but production fails
- Maintenance burden exceeds value
Correct Pattern: Real Integration Tests
# ✅ GOOD - Uses real database, verifies real behavior
import pytest
from graph_memory.arango_client import get_db
from graph_memory.lessons.cascade import error_cascade
@pytest.fixture
def test_lessons(request):
"""Create test data in real ArangoDB, clean up after."""
db = get_db()
lessons = db.collection("lessons")
edges = db.collection("lesson_edges")
# Insert test data
test_ids = []
r1 = lessons.insert({"title": "R1: Test Requirement", "scope": "test"})
test_ids.append(r1["_id"])
# ... more test data ...
yield {"r1": r1["_id"], ...}
# Cleanup
for id in test_ids:
lessons.delete(id)
@pytest.mark.integration
def test_cascade_real_db(test_lessons):
"""Test cascade with REAL ArangoDB - no mocks."""
result = error_cascade(
requirement="R1: Test Requirement",
scope="test",
as_of=0,
depth=2
)
assert result["items"][0]["node_id"] == test_lessons["r1"]
# This actually verifies the real query works!
LLM and Embedding Tests
For tests involving LLM or embedding calls:
@pytest.mark.llm # Marks test as requiring LLM (can skip in CI)
def test_edge_verification_with_real_llm():
"""Tests actual LLM integration - not mocked."""
result = verify_edge_with_llm(edge_id="test_edge")
assert result["verified"] in [True, False] # Real LLM response
@pytest.mark.integration
def test_recall_with_real_embedding():
"""Tests actual embedding service - not mocked."""
result = recall("test query", k=3)
assert "used_dense" in result.get("meta", {}) # Real service responded
Test Categories
| Marker | Services Required | When to Run |
|---|---|---|
| (none) | None - pure unit tests | Always |
@pytest.mark.integration |
ArangoDB | Local dev, CI with DB |
@pytest.mark.llm |
LLM API | Manual, expensive |
@pytest.mark.embedding |
Embedding service | Local dev with services |
Skip Conditions
# pytest.ini or conftest.py
import pytest
import os
def pytest_configure(config):
config.addinivalue_line("markers", "integration: requires ArangoDB")
config.addinivalue_line("markers", "llm: requires LLM API calls")
config.addinivalue_line("markers", "embedding: requires embedding service")
@pytest.fixture(autouse=True)
def skip_without_db(request):
if request.node.get_closest_marker("integration"):
try:
from graph_memory.arango_client import get_db
get_db() # Will fail if DB not available
except Exception:
pytest.skip("ArangoDB not available")
The Golden Rule
If you're writing a Fake* class or complex monkeypatch to avoid calling a real service, STOP.
Either:
- Write a real integration test that calls the service
- Don't write a test at all (better than a false-confidence test)
- Mark the test to skip when service unavailable
A test that passes with mocks but fails in production is worse than no test - it gives false confidence.
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.