transcribe-pdf

by @jmservera in AI & LLM

# Install this skill:

npx skills add jmservera/agent-skills --skill "transcribe-pdf"

Install specific skill from multi-skill repository

# Description

Render a scanned PDF to per-page images, auto-crop UI/header bars, then transcribe page-by-page using LLM vision (no OCR) into a single Markdown file.

# SKILL.md

name: transcribe-pdf
description: Render a scanned PDF to per-page images, auto-crop UI/header bars, then transcribe page-by-page using LLM vision (no OCR) into a single Markdown file.

Transcribe a scanned PDF to Markdown (vision-only)

Make it autonomous (reduce prompts)

This skill writes images and a running transcription.md. If your Copilot CLI asks for approval on each file write, have the user run /allow-all (or /yolo) once at the start of the session so you can proceed without repeated interruptions.

Hard rules (must follow)

NO OCR: do not run OCR tools or OCR libraries. Transcription must be done using LLM vision.
Whole document, one page at a time: render/prepare images for the whole PDF once, but transcribe sequentially page-by-page (never batch multiple pages into one vision prompt).
Don’t overdo it: transcribe what’s legible in the page image; use placeholders for uncertain/unreadable parts.
No zoom / no slicing unless necessary: prefer the already-cropped full-page image.
Do not stop to ask: after starting, continue through all pages, appending as you go.

Why cropping matters (key finding)

Many archival viewers add header/footer UI bars (e.g., “PARES …”). These reduce effective resolution and can confuse transcription.
Always crop those bars out before transcribing.

Workflow

0) Install dependencies (once)

pip install -r "<skill_root>\scripts\requirements.txt"

1) (Optional) Confirm there is no usable text layer

This is a quick sanity check to avoid wasting time trying to extract text that isn’t there.

python "<skill_root>\scripts\check_text_layer_sample.py" "<pdf_path>"

If the samples show ~0 characters or only metadata, proceed with image transcription.

2) Prepare images for the whole PDF (single command)

If the workspace already contains correctly-cropped page-*.png images, skip to Step 3.

Recommended: render + crop in one run.

python "<skill_root>\scripts\render_and_crop_pdf.py" "<pdf_path>" "<out_dir>" --zoom 2.0

Outputs:
- "<out_dir>\cropped\page-01.png", page-02.png, … (canonical inputs for transcription)

Crop defaults are tuned to remove common header/footer bars at --zoom 2.0:
- --top 280 --bottom 140 (adjust if needed)

3) Transcribe the whole document, one page at a time (LLM vision)

Create the transcript file if it doesn’t exist, then append each page immediately (never hold results in chat).

Transcript path: "<out_dir>\transcription.md"
For each page image in "<out_dir>\cropped":
Open/view exactly one image (e.g., page-21.png).
Transcribe what you can see into Markdown.
Append immediately to the transcript under a heading ## Page N.
Move to the next page and repeat until the final page.

Transcription conventions

Preserve original spelling/orthography when possible (old Spanish is expected).
Use placeholders instead of guessing:
[ilegible] for unreadable words/lines
[¿…?] for uncertain readings
If a page is blank or non-text: write [Página en blanco] or a short note.

Helper scripts included

render_pdf_to_images.py: render PDF pages to PNGs.
render_and_crop_pdf.py: render + crop in one command (preferred).
crop_images.py: crop an existing set of page-*.png images.
check_text_layer_sample.py: sample embedded PDF text layer (non-OCR).

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.