grahama1970

extractor

0
0
# Install this skill:
npx skills add grahama1970/agent-skills --skill "extractor"

Install specific skill from multi-skill repository

# Description

>

# SKILL.md


name: extractor
description: >
Extract content from any document using the Preset-First Agentic Pipeline.
Auto-detects format and document type (scientific papers, requirements specs, etc.).
Supports PDF, DOCX, HTML, XML, PPTX, XLSX, EPUB, Markdown, images.
Use when user says "extract this", "convert to markdown", "process pdf", or provides a document.
allowed-tools: Bash, Read
triggers:
- extract this
- extract document
- extract pdf
- extract text
- convert to markdown
- convert to text
- parse this file
- process document
- process pdf
- get sections from
- extract sections
- run extractor
- pdf to markdown
- docx to markdown
- document to json
metadata:
short-description: Preset-First document extraction (PDF/DOCX/HTML/XML)
project-path: /home/graham/workspace/experiments/extractor


Extractor

Self-correcting agentic document extraction using a Preset-First Methodology.
Auto-detects document type and applies calibrated extraction settings.

Quick Start

# Auto mode (recommended) - detects document type automatically
.pi/skills/extractor/run.sh paper.pdf

# Specify output directory
.pi/skills/extractor/run.sh paper.pdf --out ./results

# Get markdown output directly
.pi/skills/extractor/run.sh paper.pdf --markdown

# OCR scanned PDFs (lazy-loads OCRmyPDF docker image if needed)
.pi/skills/extractor/run.sh scanned.pdf --auto-ocr

Extraction Modes

Mode Flag Description
Auto (default) Profile detector picks best settings
Fast --fast PyMuPDF only, no ML/LLM (fastest)
Accurate --accurate Full pipeline with LLM enhancements
Offline --offline Deterministic, no network calls
# Fast mode - quick extraction, no LLM
.pi/skills/extractor/run.sh report.pdf --fast

# Accurate mode - full pipeline with LLM for tables/math
.pi/skills/extractor/run.sh paper.pdf --accurate

# Offline smoke test (deterministic)
.pi/skills/extractor/run.sh doc.pdf --offline

Collaboration Flow

For PDFs without --preset, the skill runs an intelligent collaboration flow:

  1. Profile Detection: Analyzes document (layout, tables, formulas, requirements)
  2. High Confidence Match: If confidence >= 8, auto-extracts with detected preset
  3. Low Confidence / Unknown:
  4. Interactive (TTY): Prompts user to select preset
  5. Non-interactive: Uses auto mode with warning
# See what the detector finds (no extraction)
.pi/skills/extractor/run.sh paper.pdf --profile-only

# Output:
# {
#   "preset": "arxiv",
#   "confidence": 12,
#   "tables": true,
#   "figures": true,
#   "formulas": true,
#   "recommended_mode": "accurate"
# }

# Interactive prompt (in terminal)
.pi/skills/extractor/run.sh unknown_paper.pdf
# Analyzing: unknown_paper.pdf
# Detected: multi-column layout, 12 pages
# Contains: tables, figures, formulas
#
# Select extraction preset:
#   [1] arxiv - Academic papers [RECOMMENDED]
#   [2] requirements_spec - Engineering specs
#   [3] auto - Let pipeline decide
#   [4] fast - Quick extraction, no LLM
# Enter choice [1-4]:

# Non-interactive (batch/CI) - auto-selects
echo | .pi/skills/extractor/run.sh paper.pdf --no-interactive

Preset Selection

The pipeline auto-detects document type via s00_profile_detector:

Preset Detected When Confidence Points
arxiv Academic papers (2-column, math, "Abstract/References") +5 filename, +4 sections, +3 layout
requirements_spec Engineering specs (REQ-xxx, "Shall", nested sections) +5 filename, +4 REQ pattern
auto Unknown documents Fallback when confidence < 8
# Force a specific preset (skip detection)
.pi/skills/extractor/run.sh paper.pdf --preset arxiv
.pi/skills/extractor/run.sh spec.pdf --preset requirements_spec

# Let collaboration flow decide
.pi/skills/extractor/run.sh paper.pdf

Output Options

# JSON output (default) - full structured data
.pi/skills/extractor/run.sh doc.pdf --json

# Markdown output - human-readable text
.pi/skills/extractor/run.sh doc.pdf --markdown

# Sections only (skip tables/figures)
.pi/skills/extractor/run.sh doc.pdf --sections-only

Supported Formats

Cross-format parity measured against HTML reference (2026-01-17):

Format Method Parity Notes
Markdown Direct parse 100% Perfect structural match
DOCX Native XML (python-docx) 100% Perfect structural match
HTML BeautifulSoup Reference Baseline for comparison
XML defusedxml 90% Structure preserved, markdown differs
PDF 14-stage pipeline 87% Varies by document complexity
RST docutils 85% Section structure varies
EPUB ebooklib 82% Chapter structure varies
PPTX python-pptx 81% Slide-based structure
XLSX openpyxl 16% Expected (spreadsheet format)
Images OCR/VLM 16% Requires VLM for text extraction

Pipeline Stages

The full pipeline runs 14+ stages:

00_profile_detector     Detect document type, select preset
01_annotation_processor Strip PDF annotations
02_marker_extractor     Extract blocks (text, tables, figures)
03_suspicious_headers   Verify header classifications with VLM
04_section_builder      Build document sections
05_table_extractor      Extract and describe tables
06_figure_extractor     Extract and describe figures
07_duckdb_ingest        Assemble into queryable DB
08_extract_requirements Mine requirements (if detected)
08b_lean4_theorem_prover Formal proofs (scientific only)
09_section_summarizer   Generate section summaries
10_markdown_exporter    Export to Markdown
14_report_generator     Generate extraction report

Output Structure

{
  "success": true,
  "preset": "arxiv",
  "outputs": {
    "markdown": "results/10_markdown_exporter/document.md",
    "sections": "results/04_section_builder/json_output/04_sections.json",
    "tables": "results/05_table_extractor/json_output/05_tables.json",
    "figures": "results/06_figure_extractor/json_output/06_figures.json",
    "report": "results/14_report_generator/json_output/final_report.json"
  },
  "counts": {
    "sections": 12,
    "tables": 5,
    "figures": 8
  }
}

Batch Processing

# Process all PDFs in a directory
.pi/skills/extractor/run.sh ./documents/ --out ./results

# With glob pattern
.pi/skills/extractor/run.sh ./documents/ --glob "**/*.pdf"

# Non-interactive batch (CI/scripts)
.pi/skills/extractor/run.sh ./documents/ --no-interactive

# Force preset for entire batch
.pi/skills/extractor/run.sh ./documents/ --preset arxiv --out ./results

Agent-Friendly Flags

Flag Purpose
--profile-only Return profile JSON without extraction
--no-interactive Skip prompts, use auto mode
--preset <name> Force preset (skip detection)
--fast No LLM, quick extraction
--toc-check Check TOC integrity against extracted sections
--auto-ocr OCR scanned PDFs with OCRmyPDF (lazy-loads docker image)
--no-auto-ocr Disable OCRmyPDF preprocessing for scanned PDFs
--skip-scanned Skip scanned PDFs and write a skip manifest
--ocr-lang <langs> OCR language(s), e.g. eng or eng+deu
--ocr-deskew Deskew scanned pages during OCR
--ocr-force Force OCR even if text exists
--ocr-timeout <sec> OCR timeout in seconds
--continue-on-error Continue pipeline on step failures (batch-friendly)

TOC Integrity Check

Verify that extracted sections match the PDF's Table of Contents (bookmarks):

# Check integrity on pipeline output directory
.pi/skills/extractor/run.sh ./results/ --toc-check

# Check specific DuckDB file
.pi/skills/extractor/run.sh ./results/corpus.duckdb --toc-check

Output:

{
  "success": true,
  "has_toc": true,
  "integrity_score": 0.85,
  "status": "GOOD",
  "toc_entries_count": 20,
  "sections_count": 18,
  "matched_count": 17,
  "missing_count": 3,
  "matched": [
    { "toc_title": "1. Introduction", "section_id": "sec_001", "score": 0.95 }
  ],
  "missing": [{ "toc_title": "Appendix A", "toc_page": 45 }]
}

Status levels:

  • EXCELLENT: >= 90% match
  • GOOD: >= 70% match
  • FAIR: >= 50% match
  • POOR: < 50% match

Environment

Requires the extractor project with its virtual environment:

  • Project: /home/graham/workspace/experiments/extractor
  • Venv: .venv/bin/python
  • Dependencies: scillm, fetcher (local paths)

Set EXTRACTOR_ROOT to override the project location.

Sanity Check

# Verify skill works across all formats
.pi/skills/extractor/sanity.sh

Tests: HTML, MD, XML, RST, DOCX, PPTX, EPUB, XLSX, PDF, PNG

LLM Requirements

For accurate mode (VLM/table descriptions):

  • CHUTES_API_BASE - Chutes API endpoint
  • CHUTES_API_KEY - API key
  • CHUTES_VLM_MODEL - Vision model (default: Qwen/Qwen3-VL-235B-A22B-Instruct)
  • CHUTES_TEXT_MODEL - Text model (default: moonshotai/Kimi-K2-Instruct-0905)

For Lean4 proving (arxiv preset):

  • lean_runner container running
  • OPENROUTER_API_KEY set

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.