Use when you have a written implementation plan to execute in a separate session with review checkpoints
npx skills add grahama1970/agent-skills --skill "lean4-prove"
Install specific skill from multi-skill repository
# Description
>
# SKILL.md
name: lean4-prove
description: >
Retrieval-augmented Lean4 proof generation. Queries 94k+ exemplars from DeepSeek-Prover V1+V2,
uses hybrid search (BM25 + semantic + graph), generates via Claude, compiles in Docker, retries on failure.
allowed-tools: Bash, Read, Docker
triggers:
- prove this
- lean4 proof
- generate proof
- verify lean4
- lean4-prove
- formalize this requirement
metadata:
short-description: Retrieval-augmented Lean4 proof generation
lean4-prove
Retrieval-augmented Lean4 proof generation for engineering requirements. Uses 94,000+ proven theorems from DeepSeek-Prover V1+V2 to guide proof synthesis via hybrid search (BM25 + semantic + graph traversal).
Architecture
Requirement + Tactics + Persona
│
▼
┌───────────────────────────┐
│ 1. RECALL similar proofs │ ← Hybrid search on ArangoDB
│ from 94k+ exemplars │ (BM25 + semantic + graph)
└───────────────────────────┘
│
▼
┌───────────────────────────┐
│ 2. BUILD support pack │
│ - Validated imports │
│ - Tactic patterns │
│ - Similar proofs │
└───────────────────────────┘
│
▼
┌───────────────────────────┐
│ 3. GENERATE N candidates │ ← Claude with exemplar context
│ constrained by corpus │
└───────────────────────────┘
│
▼
┌───────────────────────────┐
│ 4. COMPILE each in │ ← lean_runner Docker
│ lean_runner container │
└───────────────────────────┘
│
┌────┴────┐
│ │
Success Failure
│ │
▼ ▼
Return Retry with error feedback
(up to max_retries)
Why Retrieval-Augmented?
- Determinism - Exact provenance: "used these 3 proofs as templates"
- Version alignment - Exemplars use imports that actually work
- Fewer hallucinations - Constrained to lemmas that exist
- Tactic idioms - Transfers working
simpsets and proof patterns
Usage
# Basic proof
./run.sh --requirement "Prove n + 0 = n"
# With tactics preference
./run.sh -r "Prove commutativity of addition" -t "simp,ring,omega"
# With persona context
./run.sh -r "Prove message integrity" -p "cryptographer"
# Via stdin (JSON)
echo '{"requirement": "Prove n + 0 = n", "tactics": ["rfl"]}' | ./run.sh
# Custom settings
./run.sh -r "Prove theorem" --candidates 5 --retries 5 --model claude-sonnet-4-20250514
Output
{
"success": true,
"code": "import Mathlib\n\ntheorem list_append_length (xs ys : List α) :\n (xs ++ ys).length = xs.length + ys.length := by\n induction xs with\n | nil => simp\n | cons x xs ih => simp [List.cons_append, ih]",
"attempts": 1,
"candidate": 0,
"errors": null,
"retrieval": {
"retrieved": 5,
"tactics_added": ["simp", "aesop", "norm_num", "intro"],
"imports_count": 3
}
}
On failure:
{
"success": false,
"code": null,
"attempts": 9,
"errors": [
"Candidate 0 attempt 1: unknown identifier 'natAdd'",
"Candidate 1 attempt 1: type mismatch..."
],
"retrieval": {
"retrieved": 5,
"tactics_added": ["simp", "exact"],
"imports_count": 3
}
}
Parameters
| Parameter | Default | Description |
|---|---|---|
--requirement, -r |
(required) | Theorem to prove |
--tactics, -t |
none | Comma-separated preferred tactics |
--persona, -p |
none | Persona context for generation |
--candidates, -n |
3 | Parallel proof candidates |
--retries |
3 | Max retries per candidate |
--model |
claude-sonnet-4-20250514 | Claude model |
--container |
lean_runner | Docker container name |
--timeout |
120 | Compilation timeout (seconds) |
Environment Variables
# Proof generation
LEAN4_CONTAINER=lean_runner # Docker container
LEAN4_TIMEOUT=120 # Compile timeout
LEAN4_MAX_RETRIES=3 # Retries per candidate
LEAN4_CANDIDATES=3 # Parallel candidates
LEAN4_PROVE_MODEL=opus # Claude model (opus recommended for proofs)
# Retrieval (requires ArangoDB with ingested dataset)
LEAN4_RETRIEVAL=1 # Enable/disable retrieval (default: 1)
LEAN4_RETRIEVAL_K=5 # Number of exemplars to retrieve
ARANGO_URL=http://127.0.0.1:8529 # ArangoDB connection
ARANGO_DB=memory # Database name (same as memory skill)
Dataset Setup
The skill uses 94,000+ theorems from DeepSeek-Prover V1+V2 for retrieval:
- V1: 27,503 theorems (status: "proven")
- V2: 66,708 theorems (status: "ok" for 11,689 proven + others)
One-time ingest:
# Ingest full dataset (~5 min)
./ingest.sh
# Or limit for testing
./ingest.sh --limit 1000
This populates the lean_theorems collection in ArangoDB with:
- formal_statement - The theorem statement
- formal_proof - Working proof code
- header - Validated imports (Mathlib, Aesop, etc.)
- tactics - Extracted tactic names
- source - "deepseek-prover-v1" or "deepseek-prover-v2"
- status - "proven" (V1) or "ok"/"failed"/etc. (V2)
Authentication
Uses Claude Code CLI (claude -p) in headless non-interactive mode.
The CLI is called with:
- -p flag for print/headless mode
- --output-format text for plain text output
- --max-turns 1 for single-turn operation
Environment variables CLAUDE_CODE and CLAUDECODE are cleared to avoid recursion detection when called from within Claude Code.
No separate API key required - authentication is handled via your Claude subscription.
Requirements
- Docker with
lean_runnercontainer running (Lean4 + Mathlib installed) - Claude Code CLI (
claude) in PATH with valid authentication - ArangoDB running locally (for retrieval - optional but recommended)
- Dataset ingested via
./ingest.sh(one-time setup)
Tactics
Common Lean4/Mathlib tactics to suggest:
| Tactic | Use For |
|---|---|
rfl |
Reflexivity proofs |
simp |
Simplification |
ring |
Ring arithmetic |
omega |
Linear arithmetic |
decide |
Decidable propositions |
exact |
Exact term construction |
apply |
Apply lemmas |
induction |
Inductive proofs |
Examples
Engineering: List operations
./run.sh -r "Prove length(xs ++ ys) = length(xs) + length(ys)" -t "simp,induction"
Engineering: State machine property
./run.sh -r "When mux_enable is false, output shall equal default_value" \
-p "embedded systems engineer" -t "simp,cases,decide"
Engineering: Protocol correctness
./run.sh -r "Prove message append preserves checksum: checksum(msg ++ data) = update(checksum(msg), data)" \
-t "simp,induction,ring"
Cryptography
./run.sh -r "Prove that XOR is self-inverse: a ⊕ a = 0" -p "cryptographer" -t "simp,decide"
Complex theorem
./run.sh -r "Prove the sum of first n natural numbers equals n*(n+1)/2" \
-t "induction,simp,ring" \
--candidates 5 \
--retries 5
Difference from lean4-verify
| Skill | Purpose |
|---|---|
lean4-verify |
Compile-only. Takes Lean4 code, returns pass/fail |
lean4-prove |
Full pipeline. Takes requirement, generates + compiles + retries |
Use lean4-verify when you already have Lean4 code to check.
Use lean4-prove when you need to generate the proof from a requirement.
Advanced: Memory Integration
The skill integrates with the memory project for hybrid retrieval (BM25 + semantic + graph traversal):
# One-time setup: embed theorems and create edges
ARANGO_PASS=yourpass ARANGO_DB=memory python integrate_memory.py
This creates:
- Embeddings: 39k+ theorem embeddings in lesson_embeddings for semantic search
- Tactic edges: 10k+ edges between theorems sharing primary tactics
- Similarity edges: Edges between semantically similar theorems (cosine > 0.7)
This enables:
- Semantic search: Find theorems by meaning, not just keywords
- Multi-hop traversal: "What proofs use similar tactics?"
- Impact analysis: "If I change lemma X, what breaks?"
Lemma Dependency Graph (Planned)
Extract lemma dependencies during compilation to build an impact graph:
Query: "What's affected if mux_default changes?"
Answer:
mux_default
└── system_safe_state (NEEDS RE-VERIFY)
└── requirement_42 (NEEDS RE-VERIFY)
Engineering Use Cases
- Ambiguity detection: Formalization fails → requirement is vague
- Completeness check: "Are all state transitions covered?"
- Regression impact: "Change to X affects theorems Y, Z"
- Cross-reference: "Which requirements share this lemma?"
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.