Work with Obsidian vaults (plain Markdown notes) and automate via obsidian-cli.
npx skills add TheSimpleApp/agent-skills --skill "pdf"
Install specific skill from multi-skill repository
# Description
Extract text and tables from PDFs, fill forms, merge documents, and create new PDFs. Use when working with PDF files or when users mention PDFs, forms, or document extraction.
# SKILL.md
name: pdf
description: Extract text and tables from PDFs, fill forms, merge documents, and create new PDFs. Use when working with PDF files or when users mention PDFs, forms, or document extraction.
license: MIT
metadata:
author: thesimpleapp
version: "1.0"
PDF Processing
Work with PDF documents: extract content, fill forms, merge, and create.
Capabilities
| Task | Approach |
|---|---|
| Extract text | Parse PDF structure or use OCR for scanned docs |
| Extract tables | Identify table structures and convert to structured data |
| Fill forms | Locate form fields and populate values |
| Merge PDFs | Combine multiple documents |
| Create PDFs | Generate from HTML/Markdown or programmatically |
Python Libraries
PyPDF2 / pypdf - Basic Operations
from pypdf import PdfReader, PdfWriter
# Read PDF
reader = PdfReader("document.pdf")
for page in reader.pages:
text = page.extract_text()
print(text)
# Merge PDFs
writer = PdfWriter()
for pdf in ["file1.pdf", "file2.pdf"]:
reader = PdfReader(pdf)
for page in reader.pages:
writer.add_page(page)
writer.write("merged.pdf")
pdfplumber - Table Extraction
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
# Extract tables
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
reportlab - Create PDFs
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("output.pdf", pagesize=letter)
c.drawString(100, 750, "Hello, World!")
c.save()
JavaScript Libraries
pdf-lib - Modify PDFs
import { PDFDocument } from 'pdf-lib';
// Load existing PDF
const pdfDoc = await PDFDocument.load(existingPdfBytes);
// Add a page
const page = pdfDoc.addPage();
page.drawText('Hello, World!', { x: 50, y: 700 });
// Save
const pdfBytes = await pdfDoc.save();
pdf-parse - Extract Text
const pdf = require('pdf-parse');
const dataBuffer = fs.readFileSync('document.pdf');
const data = await pdf(dataBuffer);
console.log(data.text);
Form Filling
from pypdf import PdfReader, PdfWriter
reader = PdfReader("form.pdf")
writer = PdfWriter()
writer.append(reader)
# Fill form fields
writer.update_page_form_field_values(
writer.pages[0],
{"name": "John Doe", "email": "[email protected]"}
)
writer.write("filled_form.pdf")
OCR for Scanned PDFs
import pytesseract
from pdf2image import convert_from_path
# Convert PDF to images
images = convert_from_path("scanned.pdf")
# OCR each page
for i, image in enumerate(images):
text = pytesseract.image_to_string(image)
print(f"Page {i+1}:\n{text}")
Best Practices
- Check if scanned - If text extraction returns empty, use OCR
- Handle encoding - PDFs can have various text encodings
- Preserve formatting - Tables and layouts may not extract cleanly
- Large files - Process page-by-page for memory efficiency
- Validate output - Verify extracted data accuracy
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.