Use when you have a written implementation plan to execute in a separate session with review checkpoints
npx skills add grahama1970/agent-skills --skill "fetcher"
Install specific skill from multi-skill repository
# Description
>
# SKILL.md
name: fetcher
description: >
Fetch web pages, PDFs, and documents with automatic fallbacks and content extraction.
Use when user says "fetch this URL", "download this page", "crawl this website",
"extract content from", "get the PDF", or provides URLs needing retrieval.
allowed-tools: Bash, Read
triggers:
- fetch this URL
- download page
- crawl website
- extract content from
- get the PDF
- scrape this site
- retrieve document
metadata:
short-description: Web crawling and document fetching CLI
Fetcher - Web Crawling
Fetch web pages and documents with automatic fallbacks, proxy rotation, and content extraction.
Self-contained skill - auto-installs via uvx from git (no pre-installation needed).
Fully automatic - Playwright browsers are installed on first run for SPA/JS page support.
Simplest Usage
# Via wrapper (recommended - auto-installs)
.agents/skills/fetcher/run.sh get https://example.com
# Or directly if fetcher is installed
fetcher get https://example.com
Common Commands
./run.sh get https://example.com # Fetch single URL
./run.sh get-manifest urls.txt # Fetch list of URLs
./run.sh get-manifest - < urls.txt # Fetch from stdin
Common Patterns
Fetch a single URL
fetcher get https://www.nasa.gov --out run/nasa
Outputs to run/nasa/:
- consumer_summary.json - structured result
- Walkthrough.md - human-readable summary
- downloads/ - raw content files
Fetch multiple URLs
# From file (one URL per line)
fetcher get-manifest urls.txt --out run/batch
# From stdin
echo -e "https://example.com\nhttps://nasa.gov" | fetcher get-manifest -
ETL mode (full control)
fetcher-etl --inventory urls.jsonl --out run/etl_batch
fetcher-etl --manifest urls.txt --out run/demo
Check environment
fetcher doctor # Check dependencies and config
fetcher get --dry-run <url> # Validate without fetching
fetcher-etl --help-full # All options
fetcher-etl --find metrics # Search options
Output Structure
run/artifacts/<run-id>/
βββ results.jsonl # Fetch results per URL
βββ consumer_summary.json # Summary stats
βββ Walkthrough.md # Human-readable summary
βββ downloads/ # Raw files (HTML, PDF, etc.)
βββ text_blobs/ # Extracted text
βββ markdown/ # LLM-friendly markdown
βββ fit_markdown/ # Pruned markdown for LLM input
βββ junk_results.jsonl # Failed/junk URLs
βββ junk_table.md # Quick triage table
Content Extraction
Enable markdown output
export FETCHER_EMIT_MARKDOWN=1
export FETCHER_EMIT_FIT_MARKDOWN=1 # Pruned for LLM input
fetcher get https://example.com
Rolling windows (for chunking)
export FETCHER_DOWNLOAD_MODE=rolling_extract
export FETCHER_ROLLING_WINDOW_SIZE=6000
export FETCHER_ROLLING_WINDOW_STEP=3000
fetcher get https://example.com
Advanced Features
HTTP caching
# Cache enabled by default
fetcher get https://example.com
# Disable cache for fresh fetch
fetcher get https://example.com --no-http-cache
PDF discovery
# Auto-fetch PDF links from HTML pages
export FETCHER_ENABLE_PDF_DISCOVERY=1
export FETCHER_PDF_DISCOVERY_MAX=3
fetcher get https://example.com
Proxy rotation (rate-limited sites)
export SPARTA_STEP06_PROXY_HOST=gw.iproyal.com
export SPARTA_STEP06_PROXY_PORT=12321
export SPARTA_STEP06_PROXY_USER=team
export SPARTA_STEP06_PROXY_PASSWORD=secret
fetcher-etl --inventory urls.jsonl
Brave/Wayback fallbacks
# Enable alternate URL resolution
export BRAVE_API_KEY=sk-your-key
fetcher-etl --use-alternates --inventory urls.jsonl
Python API
import asyncio
from fetcher.workflows.web_fetch import URLFetcher, FetchConfig, write_results
from pathlib import Path
async def main():
config = FetchConfig(concurrency=4, per_domain=2)
fetcher = URLFetcher(config)
entries = [{"url": "https://www.nasa.gov"}]
results, audit = await fetcher.fetch_many(entries)
write_results(results, Path("artifacts/nasa.jsonl"))
print(audit)
asyncio.run(main())
Single URL helper
from fetcher.workflows.fetcher import fetch_url
result = await fetch_url("https://example.com")
print(result.content_verdict) # "ok", "empty", "paywall", etc.
print(result.text) # Extracted text
FetchResult Fields
| Field | Description |
|---|---|
url |
Original URL |
final_url |
After redirects |
content_verdict |
ok, empty, paywall, error, etc. |
text |
Extracted text content |
file_path |
Path to raw download |
markdown_path |
Path to markdown (if enabled) |
from_cache |
Whether result came from cache |
content_sha256 |
Content hash for change detection |
Environment Variables
| Variable | Purpose |
|---|---|
BRAVE_API_KEY |
Enable Brave search fallbacks |
FETCHER_EMIT_MARKDOWN |
Generate LLM-friendly markdown |
FETCHER_EMIT_FIT_MARKDOWN |
Generate pruned markdown |
FETCHER_DOWNLOAD_MODE |
text, download_only, rolling_extract |
FETCHER_HTTP_CACHE_DISABLE |
Disable HTTP caching |
FETCHER_ENABLE_PDF_DISCOVERY |
Auto-fetch embedded PDFs |
Troubleshooting
| Problem | Solution |
|---|---|
| Playwright missing | uvx --from "git+https://github.com/grahama1970/fetcher.git" playwright install chromium |
| SPA page returns empty/thin | Playwright auto-fallback should trigger; check used_playwright in summary |
| Stale cached results | Set FETCHER_HTTP_CACHE_DISABLE=1 for fresh fetch |
| Rate limited | Configure proxy rotation or reduce concurrency |
| Paywall detected | Check content_verdict and use alternates |
| Empty content | Check junk_results.jsonl for diagnosis |
Run fetcher doctor to check environment and dependencies.
SPA/JavaScript Page Support
Fetcher automatically falls back to Playwright for known SPA domains. If a page returns thin/empty content:
- Check if
used_playwright: 1inconsumer_summary.json - If not, the domain may need to be added to
SPA_FALLBACK_DOMAINSin fetcher source - Force fresh fetch with
FETCHER_HTTP_CACHE_DISABLE=1
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.