Use when you have a written implementation plan to execute in a separate session with review checkpoints
npx skills add grahama1970/agent-skills --skill "extractor"
Install specific skill from multi-skill repository
# Description
>
# SKILL.md
name: extractor
description: >
Extract content from any document using the Preset-First Agentic Pipeline.
Auto-detects format and document type (scientific papers, requirements specs, etc.).
Supports PDF, DOCX, HTML, XML, PPTX, XLSX, EPUB, Markdown, images.
Use when user says "extract this", "convert to markdown", "process pdf", or provides a document.
allowed-tools: Bash, Read
triggers:
- extract this
- extract document
- extract pdf
- extract text
- convert to markdown
- convert to text
- parse this file
- process document
- process pdf
- get sections from
- extract sections
- run extractor
- pdf to markdown
- docx to markdown
- document to json
metadata:
short-description: Preset-First document extraction (PDF/DOCX/HTML/XML)
project-path: /home/graham/workspace/experiments/extractor
Extractor
Self-correcting agentic document extraction using a Preset-First Methodology.
Auto-detects document type and applies calibrated extraction settings.
Quick Start
# Auto mode (recommended) - detects document type automatically
.pi/skills/extractor/run.sh paper.pdf
# Specify output directory
.pi/skills/extractor/run.sh paper.pdf --out ./results
# Get markdown output directly
.pi/skills/extractor/run.sh paper.pdf --markdown
# OCR scanned PDFs (lazy-loads OCRmyPDF docker image if needed)
.pi/skills/extractor/run.sh scanned.pdf --auto-ocr
Extraction Modes
| Mode | Flag | Description |
|---|---|---|
| Auto | (default) | Profile detector picks best settings |
| Fast | --fast |
PyMuPDF only, no ML/LLM (fastest) |
| Accurate | --accurate |
Full pipeline with LLM enhancements |
| Offline | --offline |
Deterministic, no network calls |
# Fast mode - quick extraction, no LLM
.pi/skills/extractor/run.sh report.pdf --fast
# Accurate mode - full pipeline with LLM for tables/math
.pi/skills/extractor/run.sh paper.pdf --accurate
# Offline smoke test (deterministic)
.pi/skills/extractor/run.sh doc.pdf --offline
Collaboration Flow
For PDFs without --preset, the skill runs an intelligent collaboration flow:
- Profile Detection: Analyzes document (layout, tables, formulas, requirements)
- High Confidence Match: If confidence >= 8, auto-extracts with detected preset
- Low Confidence / Unknown:
- Interactive (TTY): Prompts user to select preset
- Non-interactive: Uses auto mode with warning
# See what the detector finds (no extraction)
.pi/skills/extractor/run.sh paper.pdf --profile-only
# Output:
# {
# "preset": "arxiv",
# "confidence": 12,
# "tables": true,
# "figures": true,
# "formulas": true,
# "recommended_mode": "accurate"
# }
# Interactive prompt (in terminal)
.pi/skills/extractor/run.sh unknown_paper.pdf
# Analyzing: unknown_paper.pdf
# Detected: multi-column layout, 12 pages
# Contains: tables, figures, formulas
#
# Select extraction preset:
# [1] arxiv - Academic papers [RECOMMENDED]
# [2] requirements_spec - Engineering specs
# [3] auto - Let pipeline decide
# [4] fast - Quick extraction, no LLM
# Enter choice [1-4]:
# Non-interactive (batch/CI) - auto-selects
echo | .pi/skills/extractor/run.sh paper.pdf --no-interactive
Preset Selection
The pipeline auto-detects document type via s00_profile_detector:
| Preset | Detected When | Confidence Points |
|---|---|---|
| arxiv | Academic papers (2-column, math, "Abstract/References") | +5 filename, +4 sections, +3 layout |
| requirements_spec | Engineering specs (REQ-xxx, "Shall", nested sections) | +5 filename, +4 REQ pattern |
| auto | Unknown documents | Fallback when confidence < 8 |
# Force a specific preset (skip detection)
.pi/skills/extractor/run.sh paper.pdf --preset arxiv
.pi/skills/extractor/run.sh spec.pdf --preset requirements_spec
# Let collaboration flow decide
.pi/skills/extractor/run.sh paper.pdf
Output Options
# JSON output (default) - full structured data
.pi/skills/extractor/run.sh doc.pdf --json
# Markdown output - human-readable text
.pi/skills/extractor/run.sh doc.pdf --markdown
# Sections only (skip tables/figures)
.pi/skills/extractor/run.sh doc.pdf --sections-only
Supported Formats
Cross-format parity measured against HTML reference (2026-01-17):
| Format | Method | Parity | Notes |
|---|---|---|---|
| Markdown | Direct parse | 100% | Perfect structural match |
| DOCX | Native XML (python-docx) | 100% | Perfect structural match |
| HTML | BeautifulSoup | Reference | Baseline for comparison |
| XML | defusedxml | 90% | Structure preserved, markdown differs |
| 14-stage pipeline | 87% | Varies by document complexity | |
| RST | docutils | 85% | Section structure varies |
| EPUB | ebooklib | 82% | Chapter structure varies |
| PPTX | python-pptx | 81% | Slide-based structure |
| XLSX | openpyxl | 16% | Expected (spreadsheet format) |
| Images | OCR/VLM | 16% | Requires VLM for text extraction |
Pipeline Stages
The full pipeline runs 14+ stages:
00_profile_detector Detect document type, select preset
01_annotation_processor Strip PDF annotations
02_marker_extractor Extract blocks (text, tables, figures)
03_suspicious_headers Verify header classifications with VLM
04_section_builder Build document sections
05_table_extractor Extract and describe tables
06_figure_extractor Extract and describe figures
07_duckdb_ingest Assemble into queryable DB
08_extract_requirements Mine requirements (if detected)
08b_lean4_theorem_prover Formal proofs (scientific only)
09_section_summarizer Generate section summaries
10_markdown_exporter Export to Markdown
14_report_generator Generate extraction report
Output Structure
{
"success": true,
"preset": "arxiv",
"outputs": {
"markdown": "results/10_markdown_exporter/document.md",
"sections": "results/04_section_builder/json_output/04_sections.json",
"tables": "results/05_table_extractor/json_output/05_tables.json",
"figures": "results/06_figure_extractor/json_output/06_figures.json",
"report": "results/14_report_generator/json_output/final_report.json"
},
"counts": {
"sections": 12,
"tables": 5,
"figures": 8
}
}
Batch Processing
# Process all PDFs in a directory
.pi/skills/extractor/run.sh ./documents/ --out ./results
# With glob pattern
.pi/skills/extractor/run.sh ./documents/ --glob "**/*.pdf"
# Non-interactive batch (CI/scripts)
.pi/skills/extractor/run.sh ./documents/ --no-interactive
# Force preset for entire batch
.pi/skills/extractor/run.sh ./documents/ --preset arxiv --out ./results
Agent-Friendly Flags
| Flag | Purpose |
|---|---|
--profile-only |
Return profile JSON without extraction |
--no-interactive |
Skip prompts, use auto mode |
--preset <name> |
Force preset (skip detection) |
--fast |
No LLM, quick extraction |
--toc-check |
Check TOC integrity against extracted sections |
--auto-ocr |
OCR scanned PDFs with OCRmyPDF (lazy-loads docker image) |
--no-auto-ocr |
Disable OCRmyPDF preprocessing for scanned PDFs |
--skip-scanned |
Skip scanned PDFs and write a skip manifest |
--ocr-lang <langs> |
OCR language(s), e.g. eng or eng+deu |
--ocr-deskew |
Deskew scanned pages during OCR |
--ocr-force |
Force OCR even if text exists |
--ocr-timeout <sec> |
OCR timeout in seconds |
--continue-on-error |
Continue pipeline on step failures (batch-friendly) |
TOC Integrity Check
Verify that extracted sections match the PDF's Table of Contents (bookmarks):
# Check integrity on pipeline output directory
.pi/skills/extractor/run.sh ./results/ --toc-check
# Check specific DuckDB file
.pi/skills/extractor/run.sh ./results/corpus.duckdb --toc-check
Output:
{
"success": true,
"has_toc": true,
"integrity_score": 0.85,
"status": "GOOD",
"toc_entries_count": 20,
"sections_count": 18,
"matched_count": 17,
"missing_count": 3,
"matched": [
{ "toc_title": "1. Introduction", "section_id": "sec_001", "score": 0.95 }
],
"missing": [{ "toc_title": "Appendix A", "toc_page": 45 }]
}
Status levels:
- EXCELLENT: >= 90% match
- GOOD: >= 70% match
- FAIR: >= 50% match
- POOR: < 50% match
Environment
Requires the extractor project with its virtual environment:
- Project:
/home/graham/workspace/experiments/extractor - Venv:
.venv/bin/python - Dependencies:
scillm,fetcher(local paths)
Set EXTRACTOR_ROOT to override the project location.
Sanity Check
# Verify skill works across all formats
.pi/skills/extractor/sanity.sh
Tests: HTML, MD, XML, RST, DOCX, PPTX, EPUB, XLSX, PDF, PNG
LLM Requirements
For accurate mode (VLM/table descriptions):
CHUTES_API_BASE- Chutes API endpointCHUTES_API_KEY- API keyCHUTES_VLM_MODEL- Vision model (default: Qwen/Qwen3-VL-235B-A22B-Instruct)CHUTES_TEXT_MODEL- Text model (default: moonshotai/Kimi-K2-Instruct-0905)
For Lean4 proving (arxiv preset):
lean_runnercontainer runningOPENROUTER_API_KEYset
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.