Use when you have a written implementation plan to execute in a separate session with review checkpoints
npx skills add Aidenwu0209/PaddleOCR-Skills --skill "paddleocr-vl"
Install specific skill from multi-skill repository
# Description
>
# SKILL.md
name: paddleocr-vl
description: >
Advanced document parsing with PaddleOCR-VL vision-language model. Returns complete document
structure including text, tables, formulas, charts, and layout information. Claude extracts
relevant content based on user needs.
PaddleOCR-VL Document Parsing Skill
When to Use This Skill
✅ Use PaddleOCR-VL for:
- Documents with tables (invoices, financial reports, spreadsheets)
- Documents with mathematical formulas (academic papers, scientific documents)
- Documents with charts and diagrams
- Multi-column layouts (newspapers, magazines, brochures)
- Complex document structures requiring layout analysis
- Any document requiring structured understanding
❌ Use PP-OCRv5 instead for:
- Simple text-only extraction
- Quick OCR tasks where speed is critical
- Screenshots or simple images with clear text
How to Use This Skill
⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔
- ONLY use PaddleOCR-VL API - Execute the script
python scripts/paddleocr-vl/vl_caller.py - NEVER use Claude's built-in vision - Do NOT parse documents yourself
- NEVER offer alternatives - Do NOT suggest "I can try to analyze it" or similar
- IF API fails - Display the error message and STOP immediately
- NO fallback methods - Do NOT attempt document parsing any other way
If the script execution fails (API not configured, network error, etc.):
- Show the error message to the user
- Do NOT offer to help using your vision capabilities
- Do NOT ask "Would you like me to try parsing it?"
- Simply stop and wait for user to fix the configuration
Basic Workflow
- Execute document parsing:
bash python scripts/paddleocr-vl/vl_caller.py --file-url "URL provided by user"
Or for local files:
bash python scripts/paddleocr-vl/vl_caller.py --file-path "file path"
Save result to file (recommended):
bash
python scripts/paddleocr-vl/vl_caller.py --file-url "URL" --output result.json --pretty
- The script will display: Result saved to: /absolute/path/to/result.json
- This message appears on stderr, the JSON is saved to the file
- Tell the user the file path shown in the message
- The script returns COMPLETE JSON with all document content:
- Headers, footers, page numbers
- Main text content
- Tables with structure
- Formulas (with LaTeX)
- Figures and charts
- Footnotes and references
-
Layout and reading order
-
Extract what the user needs from the complete data based on their request.
IMPORTANT: Complete Content Display
CRITICAL: You must display the COMPLETE extracted content to the user based on their needs.
- The script returns ALL document content in a structured format
- Display the full content requested by the user, do NOT truncate or summarize
- If user asks for "all text", show the entire
result.full_text - If user asks for "tables", show ALL tables in the document
- If user asks for "main content", filter out headers/footers but show ALL body text
What this means:
- ✅ DO: Display complete text, all tables, all formulas as requested
- ✅ DO: Present content in reading order using reading_order array
- ❌ DON'T: Truncate with "..." unless content is excessively long (>10,000 chars)
- ❌ DON'T: Summarize or provide excerpts when user asks for full content
- ❌ DON'T: Say "Here's a preview" when user expects complete output
Example - Correct:
User: "Extract all the text from this document"
Claude: I've parsed the complete document. Here's all the extracted text:
[Display entire result.full_text or concatenated regions in reading order]
Document Statistics:
- Total regions: 25
- Text blocks: 15
- Tables: 3
- Formulas: 2
Quality: Excellent (confidence: 0.92)
Example - Incorrect ❌:
User: "Extract all the text"
Claude: "I found a document with multiple sections. Here's the beginning:
'Introduction...' (content truncated for brevity)"
Understanding the JSON Response
The script returns a complete JSON structure:
{
"ok": true,
"result": {
"full_text": "Complete text with all content including headers, footers, etc.",
"layout": {
"regions": [
{
"id": 0,
"type": "header",
"content": "Chapter 3: Methods",
"bbox": [100, 50, 500, 100]
},
{
"id": 1,
"type": "text",
"content": "Main body text content...",
"bbox": [100, 150, 500, 300]
},
{
"id": 2,
"type": "table",
"content": {
"rows": 3,
"cols": 2,
"cells": [["Header1", "Header2"], ["Data1", "Data2"]]
},
"bbox": [100, 350, 500, 550]
},
{
"id": 3,
"type": "formula",
"content": "E = mc^2",
"latex": "$E = mc^2$",
"bbox": [200, 600, 400, 630]
},
{
"id": 4,
"type": "footnote",
"content": "[1] Reference citation",
"bbox": [100, 650, 500, 680]
},
{
"id": 5,
"type": "footer",
"content": "University Name 2024",
"bbox": [100, 750, 500, 800]
},
{
"id": 6,
"type": "page_number",
"content": "25",
"bbox": [250, 770, 280, 790]
}
],
"reading_order": [0, 1, 2, 3, 4, 5, 6]
}
},
"metadata": {
"processing_time_ms": 3500,
"total_pages": 1,
"model_version": "paddleocr-vl-0.9b"
}
}
Region Types
The type field indicates the element category:
| Type | Description | Typically Include? |
|---|---|---|
header |
Page headers (chapter/section titles) | Exclude (usually repetitive) |
text |
Main body text | Include |
table |
Tables with structured data | Include |
formula |
Mathematical formulas | Include |
figure |
Images, charts, diagrams | Include |
footnote |
Footnotes and references | Include (often important) |
footer |
Page footers (author/institution) | Exclude (usually repetitive) |
page_number |
Page numbers | Exclude (not content) |
margin_note |
Margin annotations | Context-dependent |
Content Extraction Guidelines
Based on user intent, filter the regions:
| User Says | What to Extract | How |
|---|---|---|
| "Extract main content" | text, table, formula, figure, footnote | Skip header, footer, page_number |
| "Get all tables" | table only | Filter by type="table" |
| "Extract formulas" | formula only | Filter by type="formula" |
| "Complete document" | Everything | Use all regions or full_text |
| "Without headers/footers" | Core content | Skip header, footer types |
| "Include page numbers" | Core + page_number | Keep page_number type |
Usage Examples
Example 1: Extract Main Content (default behavior)
python scripts/paddleocr-vl/vl_caller.py \
--file-url "https://example.com/paper.pdf" \
--pretty
Then filter JSON to extract core content:
- Include: text, table, formula, figure, footnote
- Exclude: header, footer, page_number
Example 2: Extract Tables Only
python scripts/paddleocr-vl/vl_caller.py \
--file-path "./financial_report.pdf" \
--pretty
Then filter JSON:
- Only keep regions where type="table"
- Present table content in markdown format
Example 3: Complete Document with Everything
python scripts/paddleocr-vl/vl_caller.py \
--file-url "URL" \
--pretty
Then use result.full_text or present all regions in reading_order.
First-Time Configuration
When API is not configured:
The error will show:
Configuration error: API not configured. Get your API at: https://aistudio.baidu.com/paddleocr/task
Auto-configuration workflow:
-
Show the exact error message to user (including the URL)
-
Tell user to provide credentials:
Please visit the URL above to get your VL_API_URL and VL_TOKEN. Once you have them, send them to me and I'll configure it automatically. -
When user provides credentials (accept any format):
VL_API_URL=https://xxx.com/v1, VL_TOKEN=abc123...Here's my API: https://xxx and token: abc123- Copy-pasted code format
-
Any other reasonable format
-
Parse credentials from user's message:
- Extract VL_API_URL value (look for URLs)
-
Extract VL_TOKEN value (long alphanumeric string, usually 40+ chars)
-
Configure automatically:
bash python scripts/paddleocr-vl/configure.py --api-url "PARSED_URL" --token "PARSED_TOKEN" -
If configuration succeeds:
- Inform user: "Configuration complete! Parsing document now..."
-
Retry the original parsing task
-
If configuration fails:
- Show the error
- Ask user to verify the credentials
IMPORTANT: The error message format is STRICT and must be shown exactly as provided by the script. Do not modify or paraphrase it.
Handling Large Files (>20MB)
Problem: Local files larger than 20MB are rejected by default.
Solutions (Choose based on your situation):
Solution 1: Use URL Upload (Recommended) ⭐
Upload your file to a web server and use --file-url:
# Instead of local file
python scripts/paddleocr-vl/vl_caller.py --file-path "large_file.pdf" # ❌ May fail
# Use URL instead
python scripts/paddleocr-vl/vl_caller.py --file-url "https://your-server.com/large_file.pdf" # ✅ No size limit
Benefits:
- No local file size limit
- Faster upload (direct from API server)
- Suitable for very large files (>100MB)
Solution 2: Increase Size Limit
Adjust the limit in .env file:
# .env
VL_MAX_FILE_SIZE_MB=50 # Increase from 20MB to 50MB
Then process as usual:
python scripts/paddleocr-vl/vl_caller.py --file-path "large_file.pdf"
Note: Your API provider may still have upload limits. Check with your VL API service.
Solution 3: Compress/Optimize File
Use the built-in optimizer to reduce file size:
# Install optimization dependencies first
pip install -r scripts/paddleocr-vl/requirements-optimize.txt
# Optimize image (reduce quality)
python scripts/paddleocr-vl/optimize_file.py input.png output.png --quality 70
# Optimize PDF (compress images within)
python scripts/paddleocr-vl/optimize_file.py input.pdf output.pdf --target-size 15
# Then process optimized file
python scripts/paddleocr-vl/vl_caller.py --file-path "output.pdf" --pretty
The optimizer will:
- Compress images (adjust JPEG quality)
- Resize images if needed (maintain aspect ratio)
- Compress PDF images (reduce DPI to 150)
- Show before/after size comparison
Solution 4: Process Specific Pages (PDF Only)
If you only need certain pages from a large PDF, extract them first:
# Using PyMuPDF (requires: pip install PyMuPDF)
python -c "
import fitz
doc = fitz.open('large.pdf')
writer = fitz.open()
writer.insert_pdf(doc, from_page=0, to_page=4) # Pages 1-5
writer.save('pages_1_5.pdf')
"
# Then process the smaller file
python scripts/paddleocr-vl/vl_caller.py --file-path "pages_1_5.pdf"
Solution 5: Upload to Cloud Storage
For extremely large files (>100MB), use cloud storage with public URLs:
# Example: Upload to AWS S3, Google Drive, or similar
# Get public URL: https://storage.example.com/my-document.pdf
# Process via URL
python scripts/paddleocr-vl/vl_caller.py --file-url "https://storage.example.com/my-document.pdf"
Comparison Table:
| Solution | Max Size | Speed | Complexity | Best For |
|---|---|---|---|---|
| URL Upload | Unlimited | Fast | Low | Any large file |
| Increase Limit | Configurable | Medium | Very Low | Slightly over limit |
| Compress | ~70% reduction | Slow | Medium | Images/PDFs with images |
| Extract Pages | As needed | Fast | Medium | Multi-page PDFs |
| Cloud Storage | Unlimited | Fast | High | Very large files (>100MB) |
Error Handling
Authentication failed (401/403):
error: Authentication failed
→ Token is invalid, reconfigure with correct credentials
API quota exceeded (429):
error: API quota exceeded
→ Daily API quota exhausted, inform user to wait or upgrade
Unsupported format:
error: Unsupported file format
→ File format not supported, convert to PDF/PNG/JPG
Pseudo-Code: Content Extraction
Extract main content (most common):
def extract_main_content(json_response):
regions = json_response['result']['layout']['regions']
# Keep only core content types
core_types = ['text', 'table', 'formula', 'figure', 'footnote']
main_regions = [r for r in regions if r['type'] in core_types]
# Sort by reading order
reading_order = json_response['result']['layout']['reading_order']
sorted_regions = sort_by_order(main_regions, reading_order)
# Present to user
for region in sorted_regions:
if region['type'] == 'text':
print(region['content'])
elif region['type'] == 'table':
print_as_markdown_table(region['content'])
elif region['type'] == 'formula':
print(f"Formula: {region['latex']}")
Extract tables only:
def extract_tables(json_response):
regions = json_response['result']['layout']['regions']
tables = [r for r in regions if r['type'] == 'table']
for i, table in enumerate(tables):
print(f"Table {i+1}:")
print_as_markdown_table(table['content'])
Extract everything:
def extract_complete(json_response):
# Simply use the full_text which includes everything
print(json_response['result']['full_text'])
# Or present all regions in order
regions = json_response['result']['layout']['regions']
reading_order = json_response['result']['layout']['reading_order']
for idx in reading_order:
region = regions[idx]
print(f"[{region['type']}] {region['content']}")
Important Notes
- The script NEVER filters content - It always returns complete data
- Claude decides what to present - Based on user's specific request
- All data is always available - Can be re-interpreted for different needs
- No information is lost - Complete document structure preserved
Reference Documentation
For in-depth understanding of the PaddleOCR-VL system, refer to:
- references/paddleocr-vl/vl_model_spec.md - VL model specifications
- references/paddleocr-vl/layout_schema.md - Layout detection schema
- references/paddleocr-vl/output_format.md - Complete output format specification
Load these reference documents into context when:
- Debugging complex parsing issues
- Understanding layout detection algorithm
- Working with special document types
- Customizing content extraction logic
Testing the Skill
To verify the skill is working properly:
python scripts/paddleocr-vl/smoke_test.py
This tests configuration and optionally API connectivity.
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.