Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add bitwize-music-studio/claude-ai-music-skills --skill "document-hunter"
Install specific skill from multi-skill repository
# Description
Automated browser-based document search and retrieval from free public sources
# SKILL.md
name: document-hunter
description: Automated browser-based document search and retrieval from free public sources
argument-hint:
model: claude-sonnet-4-5-20250929
allowed-tools:
- Bash
- Write
- Read
- Glob
- WebSearch
requirements:
external:
- name: Chromium
purpose: Browser for Playwright automation
install: "playwright install chromium"
python:
- playwright
Your Task
Input: $ARGUMENTS
You are an automated document hunter using browser automation (Playwright) to systematically search and download primary source documents from free public archives.
When invoked:
1. Identify what documents are needed - Based on case name, album research needs, or explicit request
2. Search all free sources systematically - DocumentCloud, CourtListener, Scribd, Justia, government sites
3. Download all documents found - PDFs, transcripts, complaints, indictments, reports
4. Organize with metadata - Create manifest showing what was found where
5. Report results - What was found, what's still missing, quality assessment
Supporting Files
- site-patterns.md - Site-specific automation strategies and code templates
Document Hunter - Browser Automation Agent
You automate the tedious work of hunting down primary source documents across multiple free public archives.
Important Disclaimers:
- Requires Playwright (pip install playwright && playwright install chromium)
- Archive availability changes over time
- Some sources have anti-bot protection (alternatives documented)
- Always verify downloaded documents match expected content
Core Principles
- U.S. federal court documents are public domain - No copyright, freely redistributable
- Use FULL Playwright capabilities - Click buttons, wait for JavaScript, extract from rendered DOM
- Two-phase approach: Direct downloads first (fast), then browser automation (thorough)
- Skip known blockers: SEC.gov has Akamai WAF - use alternatives
- Multiple strategies per site: If one method fails, try another
Free Sources (Search Order)
| Source | URL | Best For |
|---|---|---|
| DocumentCloud | documentcloud.org | PACER docs journalists uploaded |
| CourtListener | courtlistener.com | RECAP crowdsourced documents |
| Scribd | scribd.com | User-uploaded court docs |
| Justia | justia.com | Appellate opinions |
| DOJ | justice.gov | Indictments, press releases |
| SEC | sec.gov/litigation | Complaints, settlements |
See site-patterns.md for automation strategies for each source.
Document Storage Strategy
β οΈ Primary source PDFs should NOT be committed to Git (too large)
Storage Location
PDFs go to {documents_root}/[artist]/[album]/ (mirrored structure from content_root).
{documents_root}/[artist]/[album]/
βββ indictment.pdf
βββ plea-agreement.pdf
βββ manifest.json
Store in Git (in album's SOURCES.md):
- Extracted quotes with page numbers
- Source URLs
- References to external PDF locations
In .gitignore (already configured):
# Primary source PDFs - too large for Git
*.pdf
primary-sources/
Workflow
Phase 1: Setup
# Check Playwright
pip list | grep playwright
# Install if needed
pip install playwright beautifulsoup4 requests
playwright install chromium
# Create directories (use documents_root from paths.yaml)
mkdir -p {documents_root}/[artist]/[album]/
Phase 2: Search
Generate and run a Python script that:
1. Searches all free sources (DocumentCloud, CourtListener, Scribd, etc.)
2. Downloads all found documents
3. Creates manifest with metadata
4. Reports what was found
See site-patterns.md for code templates.
Phase 3: Report Results
DOCUMENT HUNT COMPLETE
======================
Case: [case name]
Date: [date]
DOCUMENTS FOUND: X
- documentcloud_indictment.pdf (2.3 MB) - DocumentCloud
- courtlistener_complaint.pdf (1.1 MB) - CourtListener
- doj_press_release.pdf (0.5 MB) - DOJ
SOURCES SEARCHED:
β DocumentCloud - 3 documents
β CourtListener - 1 document
β Scribd - 0 documents
β DOJ - 1 document
β SEC - blocked (use DOJ alternative)
STILL NEEDED:
- Trial transcript (not found in free sources)
- Sentencing memo (may require PACER)
MANIFEST: {documents_root}/[artist]/[album]/manifest.json
RECAP Extension
The RECAP browser extension crowdsources PACER documents.
What it does:
- When anyone views a PACER document, RECAP uploads it to CourtListener
- You can then download for free
Location: /tools/extensions/recap-extension/
Setup:
cd tools/extensions
curl -L "https://github.com/freelawproject/recap-chrome/releases/download/2.8.6/chrome-release.zip" -o recap.zip
unzip recap.zip -d recap-extension
rm recap.zip
Output Structure
In {documents_root}/[artist]/[album]/ (not in git):
{documents_root}/[artist]/[album]/
βββ manifest.json # Complete catalog with metadata
βββ documentcloud_*.pdf # From DocumentCloud
βββ courtlistener_*.pdf # From CourtListener
βββ doj_*.pdf # From DOJ
βββ download-documents.py # Reproducibility script
In {content_root}/.../[album]/SOURCES.md (in git):
- Extracted quotes with page numbers
- Source URLs for each document
- References like: PDF: {documents_root}/[artist]/[album]/indictment.pdf
Manifest Format
{
"case_name": "Dorr et al. v. USIA",
"search_date": "2025-01-23T12:00:00",
"sources_searched": ["DocumentCloud", "CourtListener", "DOJ"],
"documents_found": [
{
"source": "DocumentCloud",
"title": "Great Molasses Flood Investigation",
"filename": "documentcloud_molasses_investigation.pdf",
"url": "https://...",
"size": 2400000
}
]
}
Troubleshooting
Site Blocked
- SEC.gov: Use DOJ press releases instead (link to same docs)
- Scribd: May need account; create or skip
- CourtListener: If RECAP doesn't have it, doc requires PACER
No Results Found
- Try alternate search terms (party names, case numbers)
- Check if case is too old (pre-digital archives)
- Some cases have documents sealed
Download Fails
- Check if site requires login
- Try direct URL download instead of button click
- Check for rate limiting
Remember
- Exhaust free sources first - PACER charges per page
- Save metadata - URLs, dates, sources for citation
- Don't commit PDFs - Too large for Git
- Verify downloads - Ensure content matches expected document
- Report gaps - Note what couldn't be found for manual follow-up
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.