paddleocr-vl

by @Aidenwu0209 in Tools

# Install this skill:

npx skills add Aidenwu0209/PaddleOCR-Skills --skill "paddleocr-vl"

Install specific skill from multi-skill repository

# Description

# SKILL.md

name: paddleocr-vl
description: >
Advanced document parsing with PaddleOCR-VL vision-language model. Returns complete document
structure including text, tables, formulas, charts, and layout information. Claude extracts
relevant content based on user needs.

PaddleOCR-VL Document Parsing Skill

When to Use This Skill

✅ Use PaddleOCR-VL for:
- Documents with tables (invoices, financial reports, spreadsheets)
- Documents with mathematical formulas (academic papers, scientific documents)
- Documents with charts and diagrams
- Multi-column layouts (newspapers, magazines, brochures)
- Complex document structures requiring layout analysis
- Any document requiring structured understanding

❌ Use PP-OCRv5 instead for:
- Simple text-only extraction
- Quick OCR tasks where speed is critical
- Screenshots or simple images with clear text

How to Use This Skill

⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔

ONLY use PaddleOCR-VL API - Execute the script python scripts/paddleocr-vl/vl_caller.py
NEVER use Claude's built-in vision - Do NOT parse documents yourself
NEVER offer alternatives - Do NOT suggest "I can try to analyze it" or similar
IF API fails - Display the error message and STOP immediately
NO fallback methods - Do NOT attempt document parsing any other way

If the script execution fails (API not configured, network error, etc.):
- Show the error message to the user
- Do NOT offer to help using your vision capabilities
- Do NOT ask "Would you like me to try parsing it?"
- Simply stop and wait for user to fix the configuration

Basic Workflow

Execute document parsing:
bash python scripts/paddleocr-vl/vl_caller.py --file-url "URL provided by user"
Or for local files:
bash python scripts/paddleocr-vl/vl_caller.py --file-path "file path"

Save result to file (recommended):
bash python scripts/paddleocr-vl/vl_caller.py --file-url "URL" --output result.json --pretty
- The script will display: Result saved to: /absolute/path/to/result.json
- This message appears on stderr, the JSON is saved to the file
- Tell the user the file path shown in the message

The script returns COMPLETE JSON with all document content:
Headers, footers, page numbers
Main text content
Tables with structure
Formulas (with LaTeX)
Figures and charts
Footnotes and references
Layout and reading order
Extract what the user needs from the complete data based on their request.

IMPORTANT: Complete Content Display

CRITICAL: You must display the COMPLETE extracted content to the user based on their needs.

The script returns ALL document content in a structured format
Display the full content requested by the user, do NOT truncate or summarize
If user asks for "all text", show the entire result.full_text
If user asks for "tables", show ALL tables in the document
If user asks for "main content", filter out headers/footers but show ALL body text

What this means:
- ✅ DO: Display complete text, all tables, all formulas as requested
- ✅ DO: Present content in reading order using reading_order array
- ❌ DON'T: Truncate with "..." unless content is excessively long (>10,000 chars)
- ❌ DON'T: Summarize or provide excerpts when user asks for full content
- ❌ DON'T: Say "Here's a preview" when user expects complete output

Example - Correct:

User: "Extract all the text from this document"
Claude: I've parsed the complete document. Here's all the extracted text:

[Display entire result.full_text or concatenated regions in reading order]

Document Statistics:
- Total regions: 25
- Text blocks: 15
- Tables: 3
- Formulas: 2
Quality: Excellent (confidence: 0.92)

Example - Incorrect ❌:

User: "Extract all the text"
Claude: "I found a document with multiple sections. Here's the beginning:
'Introduction...' (content truncated for brevity)"

Understanding the JSON Response

The script returns a complete JSON structure:

{
  "ok": true,
  "result": {
    "full_text": "Complete text with all content including headers, footers, etc.",
    "layout": {
      "regions": [
        {
          "id": 0,
          "type": "header",
          "content": "Chapter 3: Methods",
          "bbox": [100, 50, 500, 100]
        },
        {
          "id": 1,
          "type": "text",
          "content": "Main body text content...",
          "bbox": [100, 150, 500, 300]
        },
        {
          "id": 2,
          "type": "table",
          "content": {
            "rows": 3,
            "cols": 2,
            "cells": [["Header1", "Header2"], ["Data1", "Data2"]]
          },
          "bbox": [100, 350, 500, 550]
        },
        {
          "id": 3,
          "type": "formula",
          "content": "E = mc^2",
          "latex": "$E = mc^2$",
          "bbox": [200, 600, 400, 630]
        },
        {
          "id": 4,
          "type": "footnote",
          "content": "[1] Reference citation",
          "bbox": [100, 650, 500, 680]
        },
        {
          "id": 5,
          "type": "footer",
          "content": "University Name 2024",
          "bbox": [100, 750, 500, 800]
        },
        {
          "id": 6,
          "type": "page_number",
          "content": "25",
          "bbox": [250, 770, 280, 790]
        }
      ],
      "reading_order": [0, 1, 2, 3, 4, 5, 6]
    }
  },
  "metadata": {
    "processing_time_ms": 3500,
    "total_pages": 1,
    "model_version": "paddleocr-vl-0.9b"
  }
}

Region Types

The type field indicates the element category:

Type	Description	Typically Include?
`header`	Page headers (chapter/section titles)	Exclude (usually repetitive)
`text`	Main body text	Include
`table`	Tables with structured data	Include
`formula`	Mathematical formulas	Include
`figure`	Images, charts, diagrams	Include
`footnote`	Footnotes and references	Include (often important)
`footer`	Page footers (author/institution)	Exclude (usually repetitive)
`page_number`	Page numbers	Exclude (not content)
`margin_note`	Margin annotations	Context-dependent

Content Extraction Guidelines

Based on user intent, filter the regions:

User Says	What to Extract	How
"Extract main content"	text, table, formula, figure, footnote	Skip header, footer, page_number
"Get all tables"	table only	Filter by type="table"
"Extract formulas"	formula only	Filter by type="formula"
"Complete document"	Everything	Use all regions or full_text
"Without headers/footers"	Core content	Skip header, footer types
"Include page numbers"	Core + page_number	Keep page_number type

Usage Examples

Example 1: Extract Main Content (default behavior)

python scripts/paddleocr-vl/vl_caller.py \
  --file-url "https://example.com/paper.pdf" \
  --pretty

Then filter JSON to extract core content:
- Include: text, table, formula, figure, footnote
- Exclude: header, footer, page_number

Example 2: Extract Tables Only

python scripts/paddleocr-vl/vl_caller.py \
  --file-path "./financial_report.pdf" \
  --pretty

Then filter JSON:
- Only keep regions where type="table"
- Present table content in markdown format

Example 3: Complete Document with Everything

python scripts/paddleocr-vl/vl_caller.py \
  --file-url "URL" \
  --pretty

Then use result.full_text or present all regions in reading_order.

First-Time Configuration

When API is not configured:

The error will show:

Configuration error: API not configured. Get your API at: https://aistudio.baidu.com/paddleocr/task

Auto-configuration workflow:

Show the exact error message to user (including the URL)
Tell user to provide credentials:
Please visit the URL above to get your VL_API_URL and VL_TOKEN. Once you have them, send them to me and I'll configure it automatically.
When user provides credentials (accept any format):
VL_API_URL=https://xxx.com/v1, VL_TOKEN=abc123...
Here's my API: https://xxx and token: abc123
Copy-pasted code format
Any other reasonable format
Parse credentials from user's message:
Extract VL_API_URL value (look for URLs)
Extract VL_TOKEN value (long alphanumeric string, usually 40+ chars)
Configure automatically:
bash python scripts/paddleocr-vl/configure.py --api-url "PARSED_URL" --token "PARSED_TOKEN"
If configuration succeeds:
Inform user: "Configuration complete! Parsing document now..."
Retry the original parsing task
If configuration fails:
Show the error
Ask user to verify the credentials

IMPORTANT: The error message format is STRICT and must be shown exactly as provided by the script. Do not modify or paraphrase it.

Handling Large Files (>20MB)

Problem: Local files larger than 20MB are rejected by default.

Solutions (Choose based on your situation):

Solution 1: Use URL Upload (Recommended) ⭐

Upload your file to a web server and use --file-url:

# Instead of local file
python scripts/paddleocr-vl/vl_caller.py --file-path "large_file.pdf"  # ❌ May fail

# Use URL instead
python scripts/paddleocr-vl/vl_caller.py --file-url "https://your-server.com/large_file.pdf"  # ✅ No size limit

Benefits:
- No local file size limit
- Faster upload (direct from API server)
- Suitable for very large files (>100MB)

Solution 2: Increase Size Limit

Adjust the limit in .env file:

# .env
VL_MAX_FILE_SIZE_MB=50  # Increase from 20MB to 50MB

Then process as usual:

python scripts/paddleocr-vl/vl_caller.py --file-path "large_file.pdf"

Note: Your API provider may still have upload limits. Check with your VL API service.

Solution 3: Compress/Optimize File

Use the built-in optimizer to reduce file size:

# Install optimization dependencies first
pip install -r scripts/paddleocr-vl/requirements-optimize.txt

# Optimize image (reduce quality)
python scripts/paddleocr-vl/optimize_file.py input.png output.png --quality 70

# Optimize PDF (compress images within)
python scripts/paddleocr-vl/optimize_file.py input.pdf output.pdf --target-size 15

# Then process optimized file
python scripts/paddleocr-vl/vl_caller.py --file-path "output.pdf" --pretty

The optimizer will:
- Compress images (adjust JPEG quality)
- Resize images if needed (maintain aspect ratio)
- Compress PDF images (reduce DPI to 150)
- Show before/after size comparison

Solution 4: Process Specific Pages (PDF Only)

If you only need certain pages from a large PDF, extract them first:

# Using PyMuPDF (requires: pip install PyMuPDF)
python -c "
import fitz
doc = fitz.open('large.pdf')
writer = fitz.open()
writer.insert_pdf(doc, from_page=0, to_page=4)  # Pages 1-5
writer.save('pages_1_5.pdf')
"

# Then process the smaller file
python scripts/paddleocr-vl/vl_caller.py --file-path "pages_1_5.pdf"

Solution 5: Upload to Cloud Storage

For extremely large files (>100MB), use cloud storage with public URLs:

# Example: Upload to AWS S3, Google Drive, or similar
# Get public URL: https://storage.example.com/my-document.pdf

# Process via URL
python scripts/paddleocr-vl/vl_caller.py --file-url "https://storage.example.com/my-document.pdf"

Comparison Table:

Solution	Max Size	Speed	Complexity	Best For
URL Upload	Unlimited	Fast	Low	Any large file
Increase Limit	Configurable	Medium	Very Low	Slightly over limit
Compress	~70% reduction	Slow	Medium	Images/PDFs with images
Extract Pages	As needed	Fast	Medium	Multi-page PDFs
Cloud Storage	Unlimited	Fast	High	Very large files (>100MB)

Error Handling

Authentication failed (401/403):

error: Authentication failed

→ Token is invalid, reconfigure with correct credentials

API quota exceeded (429):

error: API quota exceeded

→ Daily API quota exhausted, inform user to wait or upgrade

Unsupported format:

error: Unsupported file format

→ File format not supported, convert to PDF/PNG/JPG

Pseudo-Code: Content Extraction

Extract main content (most common):

def extract_main_content(json_response):
    regions = json_response['result']['layout']['regions']

    # Keep only core content types
    core_types = ['text', 'table', 'formula', 'figure', 'footnote']
    main_regions = [r for r in regions if r['type'] in core_types]

    # Sort by reading order
    reading_order = json_response['result']['layout']['reading_order']
    sorted_regions = sort_by_order(main_regions, reading_order)

    # Present to user
    for region in sorted_regions:
        if region['type'] == 'text':
            print(region['content'])
        elif region['type'] == 'table':
            print_as_markdown_table(region['content'])
        elif region['type'] == 'formula':
            print(f"Formula: {region['latex']}")

Extract tables only:

def extract_tables(json_response):
    regions = json_response['result']['layout']['regions']
    tables = [r for r in regions if r['type'] == 'table']

    for i, table in enumerate(tables):
        print(f"Table {i+1}:")
        print_as_markdown_table(table['content'])

Extract everything:

def extract_complete(json_response):
    # Simply use the full_text which includes everything
    print(json_response['result']['full_text'])

    # Or present all regions in order
    regions = json_response['result']['layout']['regions']
    reading_order = json_response['result']['layout']['reading_order']

    for idx in reading_order:
        region = regions[idx]
        print(f"[{region['type']}] {region['content']}")

Important Notes

The script NEVER filters content - It always returns complete data
Claude decides what to present - Based on user's specific request
All data is always available - Can be re-interpreted for different needs
No information is lost - Complete document structure preserved

Reference Documentation

For in-depth understanding of the PaddleOCR-VL system, refer to:
- references/paddleocr-vl/vl_model_spec.md - VL model specifications
- references/paddleocr-vl/layout_schema.md - Layout detection schema
- references/paddleocr-vl/output_format.md - Complete output format specification

Load these reference documents into context when:
- Debugging complex parsing issues
- Understanding layout detection algorithm
- Working with special document types
- Customizing content extraction logic

Testing the Skill

To verify the skill is working properly:

python scripts/paddleocr-vl/smoke_test.py

This tests configuration and optionally API connectivity.

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.