grahama1970

lean4-prove

0
0
# Install this skill:
npx skills add grahama1970/agent-skills --skill "lean4-prove"

Install specific skill from multi-skill repository

# Description

>

# SKILL.md


name: lean4-prove
description: >
Retrieval-augmented Lean4 proof generation. Queries 94k+ exemplars from DeepSeek-Prover V1+V2,
uses hybrid search (BM25 + semantic + graph), generates via Claude, compiles in Docker, retries on failure.
allowed-tools: Bash, Read, Docker
triggers:
- prove this
- lean4 proof
- generate proof
- verify lean4
- lean4-prove
- formalize this requirement
metadata:
short-description: Retrieval-augmented Lean4 proof generation


lean4-prove

Retrieval-augmented Lean4 proof generation for engineering requirements. Uses 94,000+ proven theorems from DeepSeek-Prover V1+V2 to guide proof synthesis via hybrid search (BM25 + semantic + graph traversal).

Architecture

Requirement + Tactics + Persona
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. RECALL similar proofs  β”‚  ← Hybrid search on ArangoDB
β”‚    from 94k+ exemplars    β”‚     (BM25 + semantic + graph)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 2. BUILD support pack     β”‚
β”‚    - Validated imports    β”‚
β”‚    - Tactic patterns      β”‚
β”‚    - Similar proofs       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3. GENERATE N candidates  β”‚  ← Claude with exemplar context
β”‚    constrained by corpus  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 4. COMPILE each in        β”‚  ← lean_runner Docker
β”‚    lean_runner container  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
   β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
   β”‚         β”‚
 Success   Failure
   β”‚         β”‚
   β–Ό         β–Ό
 Return   Retry with error feedback
          (up to max_retries)

Why Retrieval-Augmented?

  1. Determinism - Exact provenance: "used these 3 proofs as templates"
  2. Version alignment - Exemplars use imports that actually work
  3. Fewer hallucinations - Constrained to lemmas that exist
  4. Tactic idioms - Transfers working simp sets and proof patterns

Usage

# Basic proof
./run.sh --requirement "Prove n + 0 = n"

# With tactics preference
./run.sh -r "Prove commutativity of addition" -t "simp,ring,omega"

# With persona context
./run.sh -r "Prove message integrity" -p "cryptographer"

# Via stdin (JSON)
echo '{"requirement": "Prove n + 0 = n", "tactics": ["rfl"]}' | ./run.sh

# Custom settings
./run.sh -r "Prove theorem" --candidates 5 --retries 5 --model claude-sonnet-4-20250514

Output

{
  "success": true,
  "code": "import Mathlib\n\ntheorem list_append_length (xs ys : List Ξ±) :\n    (xs ++ ys).length = xs.length + ys.length := by\n  induction xs with\n  | nil => simp\n  | cons x xs ih => simp [List.cons_append, ih]",
  "attempts": 1,
  "candidate": 0,
  "errors": null,
  "retrieval": {
    "retrieved": 5,
    "tactics_added": ["simp", "aesop", "norm_num", "intro"],
    "imports_count": 3
  }
}

On failure:

{
  "success": false,
  "code": null,
  "attempts": 9,
  "errors": [
    "Candidate 0 attempt 1: unknown identifier 'natAdd'",
    "Candidate 1 attempt 1: type mismatch..."
  ],
  "retrieval": {
    "retrieved": 5,
    "tactics_added": ["simp", "exact"],
    "imports_count": 3
  }
}

Parameters

Parameter Default Description
--requirement, -r (required) Theorem to prove
--tactics, -t none Comma-separated preferred tactics
--persona, -p none Persona context for generation
--candidates, -n 3 Parallel proof candidates
--retries 3 Max retries per candidate
--model claude-sonnet-4-20250514 Claude model
--container lean_runner Docker container name
--timeout 120 Compilation timeout (seconds)

Environment Variables

# Proof generation
LEAN4_CONTAINER=lean_runner      # Docker container
LEAN4_TIMEOUT=120                # Compile timeout
LEAN4_MAX_RETRIES=3              # Retries per candidate
LEAN4_CANDIDATES=3               # Parallel candidates
LEAN4_PROVE_MODEL=opus           # Claude model (opus recommended for proofs)

# Retrieval (requires ArangoDB with ingested dataset)
LEAN4_RETRIEVAL=1                # Enable/disable retrieval (default: 1)
LEAN4_RETRIEVAL_K=5              # Number of exemplars to retrieve
ARANGO_URL=http://127.0.0.1:8529 # ArangoDB connection
ARANGO_DB=memory                 # Database name (same as memory skill)

Dataset Setup

The skill uses 94,000+ theorems from DeepSeek-Prover V1+V2 for retrieval:
- V1: 27,503 theorems (status: "proven")
- V2: 66,708 theorems (status: "ok" for 11,689 proven + others)

One-time ingest:

# Ingest full dataset (~5 min)
./ingest.sh

# Or limit for testing
./ingest.sh --limit 1000

This populates the lean_theorems collection in ArangoDB with:
- formal_statement - The theorem statement
- formal_proof - Working proof code
- header - Validated imports (Mathlib, Aesop, etc.)
- tactics - Extracted tactic names
- source - "deepseek-prover-v1" or "deepseek-prover-v2"
- status - "proven" (V1) or "ok"/"failed"/etc. (V2)

Authentication

Uses Claude Code CLI (claude -p) in headless non-interactive mode.

The CLI is called with:
- -p flag for print/headless mode
- --output-format text for plain text output
- --max-turns 1 for single-turn operation

Environment variables CLAUDE_CODE and CLAUDECODE are cleared to avoid recursion detection when called from within Claude Code.

No separate API key required - authentication is handled via your Claude subscription.

Requirements

  1. Docker with lean_runner container running (Lean4 + Mathlib installed)
  2. Claude Code CLI (claude) in PATH with valid authentication
  3. ArangoDB running locally (for retrieval - optional but recommended)
  4. Dataset ingested via ./ingest.sh (one-time setup)

Tactics

Common Lean4/Mathlib tactics to suggest:

Tactic Use For
rfl Reflexivity proofs
simp Simplification
ring Ring arithmetic
omega Linear arithmetic
decide Decidable propositions
exact Exact term construction
apply Apply lemmas
induction Inductive proofs

Examples

Engineering: List operations

./run.sh -r "Prove length(xs ++ ys) = length(xs) + length(ys)" -t "simp,induction"

Engineering: State machine property

./run.sh -r "When mux_enable is false, output shall equal default_value" \
  -p "embedded systems engineer" -t "simp,cases,decide"

Engineering: Protocol correctness

./run.sh -r "Prove message append preserves checksum: checksum(msg ++ data) = update(checksum(msg), data)" \
  -t "simp,induction,ring"

Cryptography

./run.sh -r "Prove that XOR is self-inverse: a βŠ• a = 0" -p "cryptographer" -t "simp,decide"

Complex theorem

./run.sh -r "Prove the sum of first n natural numbers equals n*(n+1)/2" \
  -t "induction,simp,ring" \
  --candidates 5 \
  --retries 5

Difference from lean4-verify

Skill Purpose
lean4-verify Compile-only. Takes Lean4 code, returns pass/fail
lean4-prove Full pipeline. Takes requirement, generates + compiles + retries

Use lean4-verify when you already have Lean4 code to check.
Use lean4-prove when you need to generate the proof from a requirement.

Advanced: Memory Integration

The skill integrates with the memory project for hybrid retrieval (BM25 + semantic + graph traversal):

# One-time setup: embed theorems and create edges
ARANGO_PASS=yourpass ARANGO_DB=memory python integrate_memory.py

This creates:
- Embeddings: 39k+ theorem embeddings in lesson_embeddings for semantic search
- Tactic edges: 10k+ edges between theorems sharing primary tactics
- Similarity edges: Edges between semantically similar theorems (cosine > 0.7)

This enables:
- Semantic search: Find theorems by meaning, not just keywords
- Multi-hop traversal: "What proofs use similar tactics?"
- Impact analysis: "If I change lemma X, what breaks?"

Lemma Dependency Graph (Planned)

Extract lemma dependencies during compilation to build an impact graph:

Query: "What's affected if mux_default changes?"

Answer:
  mux_default
    └── system_safe_state (NEEDS RE-VERIFY)
          └── requirement_42 (NEEDS RE-VERIFY)

Engineering Use Cases

  1. Ambiguity detection: Formalization fails β†’ requirement is vague
  2. Completeness check: "Are all state transitions covered?"
  3. Regression impact: "Change to X affects theorems Y, Z"
  4. Cross-reference: "Which requirements share this lemma?"

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.