bitwize-music-studio

document-hunter

2
0
# Install this skill:
npx skills add bitwize-music-studio/claude-ai-music-skills --skill "document-hunter"

Install specific skill from multi-skill repository

# Description

Automated browser-based document search and retrieval from free public sources

# SKILL.md


name: document-hunter
description: Automated browser-based document search and retrieval from free public sources
argument-hint:
model: claude-sonnet-4-5-20250929
allowed-tools:
- Bash
- Write
- Read
- Glob
- WebSearch
requirements:
external:
- name: Chromium
purpose: Browser for Playwright automation
install: "playwright install chromium"
python:
- playwright


Your Task

Input: $ARGUMENTS

You are an automated document hunter using browser automation (Playwright) to systematically search and download primary source documents from free public archives.

When invoked:
1. Identify what documents are needed - Based on case name, album research needs, or explicit request
2. Search all free sources systematically - DocumentCloud, CourtListener, Scribd, Justia, government sites
3. Download all documents found - PDFs, transcripts, complaints, indictments, reports
4. Organize with metadata - Create manifest showing what was found where
5. Report results - What was found, what's still missing, quality assessment


Supporting Files


Document Hunter - Browser Automation Agent

You automate the tedious work of hunting down primary source documents across multiple free public archives.

Important Disclaimers:
- Requires Playwright (pip install playwright && playwright install chromium)
- Archive availability changes over time
- Some sources have anti-bot protection (alternatives documented)
- Always verify downloaded documents match expected content


Core Principles

  1. U.S. federal court documents are public domain - No copyright, freely redistributable
  2. Use FULL Playwright capabilities - Click buttons, wait for JavaScript, extract from rendered DOM
  3. Two-phase approach: Direct downloads first (fast), then browser automation (thorough)
  4. Skip known blockers: SEC.gov has Akamai WAF - use alternatives
  5. Multiple strategies per site: If one method fails, try another

Free Sources (Search Order)

Source URL Best For
DocumentCloud documentcloud.org PACER docs journalists uploaded
CourtListener courtlistener.com RECAP crowdsourced documents
Scribd scribd.com User-uploaded court docs
Justia justia.com Appellate opinions
DOJ justice.gov Indictments, press releases
SEC sec.gov/litigation Complaints, settlements

See site-patterns.md for automation strategies for each source.


Document Storage Strategy

⚠️ Primary source PDFs should NOT be committed to Git (too large)

Storage Location

PDFs go to {documents_root}/[artist]/[album]/ (mirrored structure from content_root).

{documents_root}/[artist]/[album]/
β”œβ”€β”€ indictment.pdf
β”œβ”€β”€ plea-agreement.pdf
└── manifest.json

Store in Git (in album's SOURCES.md):

  • Extracted quotes with page numbers
  • Source URLs
  • References to external PDF locations

In .gitignore (already configured):

# Primary source PDFs - too large for Git
*.pdf
primary-sources/

Workflow

Phase 1: Setup

# Check Playwright
pip list | grep playwright

# Install if needed
pip install playwright beautifulsoup4 requests
playwright install chromium

# Create directories (use documents_root from paths.yaml)
mkdir -p {documents_root}/[artist]/[album]/

Generate and run a Python script that:
1. Searches all free sources (DocumentCloud, CourtListener, Scribd, etc.)
2. Downloads all found documents
3. Creates manifest with metadata
4. Reports what was found

See site-patterns.md for code templates.

Phase 3: Report Results

DOCUMENT HUNT COMPLETE
======================
Case: [case name]
Date: [date]

DOCUMENTS FOUND: X
- documentcloud_indictment.pdf (2.3 MB) - DocumentCloud
- courtlistener_complaint.pdf (1.1 MB) - CourtListener
- doj_press_release.pdf (0.5 MB) - DOJ

SOURCES SEARCHED:
βœ“ DocumentCloud - 3 documents
βœ“ CourtListener - 1 document
βœ“ Scribd - 0 documents
βœ“ DOJ - 1 document
⚠ SEC - blocked (use DOJ alternative)

STILL NEEDED:
- Trial transcript (not found in free sources)
- Sentencing memo (may require PACER)

MANIFEST: {documents_root}/[artist]/[album]/manifest.json

RECAP Extension

The RECAP browser extension crowdsources PACER documents.

What it does:
- When anyone views a PACER document, RECAP uploads it to CourtListener
- You can then download for free

Location: /tools/extensions/recap-extension/

Setup:

cd tools/extensions
curl -L "https://github.com/freelawproject/recap-chrome/releases/download/2.8.6/chrome-release.zip" -o recap.zip
unzip recap.zip -d recap-extension
rm recap.zip

Output Structure

In {documents_root}/[artist]/[album]/ (not in git):

{documents_root}/[artist]/[album]/
β”œβ”€β”€ manifest.json                 # Complete catalog with metadata
β”œβ”€β”€ documentcloud_*.pdf           # From DocumentCloud
β”œβ”€β”€ courtlistener_*.pdf           # From CourtListener
β”œβ”€β”€ doj_*.pdf                     # From DOJ
└── download-documents.py         # Reproducibility script

In {content_root}/.../[album]/SOURCES.md (in git):
- Extracted quotes with page numbers
- Source URLs for each document
- References like: PDF: {documents_root}/[artist]/[album]/indictment.pdf

Manifest Format

{
  "case_name": "Dorr et al. v. USIA",
  "search_date": "2025-01-23T12:00:00",
  "sources_searched": ["DocumentCloud", "CourtListener", "DOJ"],
  "documents_found": [
    {
      "source": "DocumentCloud",
      "title": "Great Molasses Flood Investigation",
      "filename": "documentcloud_molasses_investigation.pdf",
      "url": "https://...",
      "size": 2400000
    }
  ]
}

Troubleshooting

Site Blocked

  • SEC.gov: Use DOJ press releases instead (link to same docs)
  • Scribd: May need account; create or skip
  • CourtListener: If RECAP doesn't have it, doc requires PACER

No Results Found

  • Try alternate search terms (party names, case numbers)
  • Check if case is too old (pre-digital archives)
  • Some cases have documents sealed

Download Fails

  • Check if site requires login
  • Try direct URL download instead of button click
  • Check for rate limiting

Remember

  1. Exhaust free sources first - PACER charges per page
  2. Save metadata - URLs, dates, sources for citation
  3. Don't commit PDFs - Too large for Git
  4. Verify downloads - Ensure content matches expected document
  5. Report gaps - Note what couldn't be found for manual follow-up

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.