extractor

by @grahama1970 in Tools

# Install this skill:

npx skills add grahama1970/agent-skills --skill "extractor"

Install specific skill from multi-skill repository

# Description

# SKILL.md

name: extractor
description: >
Extract content from any document using the Preset-First Agentic Pipeline.
Auto-detects format and document type (scientific papers, requirements specs, etc.).
Supports PDF, DOCX, HTML, XML, PPTX, XLSX, EPUB, Markdown, images.
Use when user says "extract this", "convert to markdown", "process pdf", or provides a document.
allowed-tools: Bash, Read
triggers:
- extract this
- extract document
- extract pdf
- extract text
- convert to markdown
- convert to text
- parse this file
- process document
- process pdf
- get sections from
- extract sections
- run extractor
- pdf to markdown
- docx to markdown
- document to json
metadata:
short-description: Preset-First document extraction (PDF/DOCX/HTML/XML)
project-path: /home/graham/workspace/experiments/extractor

Extractor

Self-correcting agentic document extraction using a Preset-First Methodology.
Auto-detects document type and applies calibrated extraction settings.

Quick Start

# Auto mode (recommended) - detects document type automatically
.pi/skills/extractor/run.sh paper.pdf

# Specify output directory
.pi/skills/extractor/run.sh paper.pdf --out ./results

# Get markdown output directly
.pi/skills/extractor/run.sh paper.pdf --markdown

# OCR scanned PDFs (lazy-loads OCRmyPDF docker image if needed)
.pi/skills/extractor/run.sh scanned.pdf --auto-ocr

Extraction Modes

Mode	Flag	Description
Auto	(default)	Profile detector picks best settings
Fast	`--fast`	PyMuPDF only, no ML/LLM (fastest)
Accurate	`--accurate`	Full pipeline with LLM enhancements
Offline	`--offline`	Deterministic, no network calls

# Fast mode - quick extraction, no LLM
.pi/skills/extractor/run.sh report.pdf --fast

# Accurate mode - full pipeline with LLM for tables/math
.pi/skills/extractor/run.sh paper.pdf --accurate

# Offline smoke test (deterministic)
.pi/skills/extractor/run.sh doc.pdf --offline

Collaboration Flow

For PDFs without --preset, the skill runs an intelligent collaboration flow:

Profile Detection: Analyzes document (layout, tables, formulas, requirements)
High Confidence Match: If confidence >= 8, auto-extracts with detected preset
Low Confidence / Unknown:
Interactive (TTY): Prompts user to select preset
Non-interactive: Uses auto mode with warning

# See what the detector finds (no extraction)
.pi/skills/extractor/run.sh paper.pdf --profile-only

# Output:
# {
#   "preset": "arxiv",
#   "confidence": 12,
#   "tables": true,
#   "figures": true,
#   "formulas": true,
#   "recommended_mode": "accurate"
# }

# Interactive prompt (in terminal)
.pi/skills/extractor/run.sh unknown_paper.pdf
# Analyzing: unknown_paper.pdf
# Detected: multi-column layout, 12 pages
# Contains: tables, figures, formulas
#
# Select extraction preset:
#   [1] arxiv - Academic papers [RECOMMENDED]
#   [2] requirements_spec - Engineering specs
#   [3] auto - Let pipeline decide
#   [4] fast - Quick extraction, no LLM
# Enter choice [1-4]:

# Non-interactive (batch/CI) - auto-selects
echo | .pi/skills/extractor/run.sh paper.pdf --no-interactive

Preset Selection

The pipeline auto-detects document type via s00_profile_detector:

Preset	Detected When	Confidence Points
arxiv	Academic papers (2-column, math, "Abstract/References")	+5 filename, +4 sections, +3 layout
requirements_spec	Engineering specs (REQ-xxx, "Shall", nested sections)	+5 filename, +4 REQ pattern
auto	Unknown documents	Fallback when confidence < 8

# Force a specific preset (skip detection)
.pi/skills/extractor/run.sh paper.pdf --preset arxiv
.pi/skills/extractor/run.sh spec.pdf --preset requirements_spec

# Let collaboration flow decide
.pi/skills/extractor/run.sh paper.pdf

Output Options

# JSON output (default) - full structured data
.pi/skills/extractor/run.sh doc.pdf --json

# Markdown output - human-readable text
.pi/skills/extractor/run.sh doc.pdf --markdown

# Sections only (skip tables/figures)
.pi/skills/extractor/run.sh doc.pdf --sections-only

Supported Formats

Cross-format parity measured against HTML reference (2026-01-17):

Format	Method	Parity	Notes
Markdown	Direct parse	100%	Perfect structural match
DOCX	Native XML (python-docx)	100%	Perfect structural match
HTML	BeautifulSoup	Reference	Baseline for comparison
XML	defusedxml	90%	Structure preserved, markdown differs
PDF	14-stage pipeline	87%	Varies by document complexity
RST	docutils	85%	Section structure varies
EPUB	ebooklib	82%	Chapter structure varies
PPTX	python-pptx	81%	Slide-based structure
XLSX	openpyxl	16%	Expected (spreadsheet format)
Images	OCR/VLM	16%	Requires VLM for text extraction

Pipeline Stages

The full pipeline runs 14+ stages:

00_profile_detector     Detect document type, select preset
01_annotation_processor Strip PDF annotations
02_marker_extractor     Extract blocks (text, tables, figures)
03_suspicious_headers   Verify header classifications with VLM
04_section_builder      Build document sections
05_table_extractor      Extract and describe tables
06_figure_extractor     Extract and describe figures
07_duckdb_ingest        Assemble into queryable DB
08_extract_requirements Mine requirements (if detected)
08b_lean4_theorem_prover Formal proofs (scientific only)
09_section_summarizer   Generate section summaries
10_markdown_exporter    Export to Markdown
14_report_generator     Generate extraction report

Output Structure

{
  "success": true,
  "preset": "arxiv",
  "outputs": {
    "markdown": "results/10_markdown_exporter/document.md",
    "sections": "results/04_section_builder/json_output/04_sections.json",
    "tables": "results/05_table_extractor/json_output/05_tables.json",
    "figures": "results/06_figure_extractor/json_output/06_figures.json",
    "report": "results/14_report_generator/json_output/final_report.json"
  },
  "counts": {
    "sections": 12,
    "tables": 5,
    "figures": 8
  }
}

Batch Processing

# Process all PDFs in a directory
.pi/skills/extractor/run.sh ./documents/ --out ./results

# With glob pattern
.pi/skills/extractor/run.sh ./documents/ --glob "**/*.pdf"

# Non-interactive batch (CI/scripts)
.pi/skills/extractor/run.sh ./documents/ --no-interactive

# Force preset for entire batch
.pi/skills/extractor/run.sh ./documents/ --preset arxiv --out ./results

Agent-Friendly Flags

Flag	Purpose
`--profile-only`	Return profile JSON without extraction
`--no-interactive`	Skip prompts, use auto mode
`--preset <name>`	Force preset (skip detection)
`--fast`	No LLM, quick extraction
`--toc-check`	Check TOC integrity against extracted sections
`--auto-ocr`	OCR scanned PDFs with OCRmyPDF (lazy-loads docker image)
`--no-auto-ocr`	Disable OCRmyPDF preprocessing for scanned PDFs
`--skip-scanned`	Skip scanned PDFs and write a skip manifest
`--ocr-lang <langs>`	OCR language(s), e.g. `eng` or `eng+deu`
`--ocr-deskew`	Deskew scanned pages during OCR
`--ocr-force`	Force OCR even if text exists
`--ocr-timeout <sec>`	OCR timeout in seconds
`--continue-on-error`	Continue pipeline on step failures (batch-friendly)

TOC Integrity Check

Verify that extracted sections match the PDF's Table of Contents (bookmarks):

# Check integrity on pipeline output directory
.pi/skills/extractor/run.sh ./results/ --toc-check

# Check specific DuckDB file
.pi/skills/extractor/run.sh ./results/corpus.duckdb --toc-check

Output:

{
  "success": true,
  "has_toc": true,
  "integrity_score": 0.85,
  "status": "GOOD",
  "toc_entries_count": 20,
  "sections_count": 18,
  "matched_count": 17,
  "missing_count": 3,
  "matched": [
    { "toc_title": "1. Introduction", "section_id": "sec_001", "score": 0.95 }
  ],
  "missing": [{ "toc_title": "Appendix A", "toc_page": 45 }]
}

Status levels:

EXCELLENT: >= 90% match
GOOD: >= 70% match
FAIR: >= 50% match
POOR: < 50% match

Environment

Requires the extractor project with its virtual environment:

Project: /home/graham/workspace/experiments/extractor
Venv: .venv/bin/python
Dependencies: scillm, fetcher (local paths)

Set EXTRACTOR_ROOT to override the project location.

Sanity Check

# Verify skill works across all formats
.pi/skills/extractor/sanity.sh

Tests: HTML, MD, XML, RST, DOCX, PPTX, EPUB, XLSX, PDF, PNG

LLM Requirements

For accurate mode (VLM/table descriptions):

CHUTES_API_BASE - Chutes API endpoint
CHUTES_API_KEY - API key
CHUTES_VLM_MODEL - Vision model (default: Qwen/Qwen3-VL-235B-A22B-Instruct)
CHUTES_TEXT_MODEL - Text model (default: moonshotai/Kimi-K2-Instruct-0905)

For Lean4 proving (arxiv preset):

lean_runner container running
OPENROUTER_API_KEY set

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.