brightdata

by @Anshin-Health-Solutions in Web & API

# Install this skill:

npx skills add Anshin-Health-Solutions/superpai --skill "brightdata"

Install specific skill from multi-skill repository

# Description

Progressive URL scraping with Bright Data. Multiple tiers from free to premium.

# SKILL.md

name: brightdata
description: "Progressive URL scraping with Bright Data. Multiple tiers from free to premium."
triggers:
- Bright Data
- scrape URL
- web scraping
- data collection

Bright Data Skill

Progressive web scraping using a three-tier escalation methodology. Always start at the cheapest tier and escalate only when blocked. This skill covers direct fetching, proxy rotation, and full browser rendering.

Three-Tier Progressive Scraping Methodology

Tier 1: Direct Fetch (Free)

Tools: curl, WebFetch tool
Use when: Target site has no anti-bot protection, public content, no JavaScript rendering required.

# Simple direct fetch
curl -s -o output.html "https://example.com/page"

# Or use WebFetch tool with extraction prompt
# WebFetch url="https://example.com/page" prompt="Extract the main article text and metadata"

Escalate to Tier 2 when: You receive 403, 429, CAPTCHA challenges, or empty/bot-detection pages.

Tier 2: Proxy Rotation (Standard)

Tools: Bright Data residential/datacenter proxies
Cost: ~$0.10-0.60 per GB depending on proxy zone type

# Datacenter proxy (cheapest, least residential)
curl -x "http://brd-customer-{CUSTOMER_ID}-zone-datacenter:[email protected]:22225" \
  -s "https://target-site.com/page"

# Residential proxy (more expensive, higher success rate)
curl -x "http://brd-customer-{CUSTOMER_ID}-zone-residential:[email protected]:22225" \
  -s "https://target-site.com/page"

# With country targeting
curl -x "http://brd-customer-{CUSTOMER_ID}-zone-residential-country-us:[email protected]:22225" \
  -s "https://target-site.com/page"

Escalate to Tier 3 when: Proxy requests still return bot detection, site requires JavaScript rendering, or content loads dynamically.

Tier 3: Browser Rendering (Premium)

Tools: Bright Data Scraping Browser or Web Unlocker
Cost: ~$1.00-3.00 per 1K requests

# Web Unlocker API (handles CAPTCHAs, fingerprinting, rendering)
curl -x "http://brd-customer-{CUSTOMER_ID}-zone-unlocker:[email protected]:22225" \
  -s "https://heavily-protected-site.com/page"

# SERP API (specialized for search engines)
curl "https://api.brightdata.com/serp/req?customer={CUSTOMER_ID}&zone=serp" \
  -H "Authorization: Bearer ${BRIGHTDATA_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{"query": "search terms", "search_engine": "google", "country": "us"}'

Cost Comparison Table

Tier	Method	Cost per 1K Requests	Success Rate	Speed
1	Direct curl/WebFetch	Free	40-60%	Fastest
2a	Datacenter proxy	~$0.10	60-75%	Fast
2b	Residential proxy	~$0.50	80-90%	Medium
3a	Web Unlocker	~$2.00	95-99%	Slower
3b	SERP API	~$3.00	99%+	Slowest

Proxy Zone Configuration

Zones are configured in the Bright Data dashboard (brightdata.com/cp). Each zone has:
- Zone Name: Identifier used in proxy URL (e.g., datacenter, residential, unlocker)
- Proxy Type: Datacenter, ISP, Residential, or Mobile
- Country Targeting: Append -country-{code} to zone name
- Session Management: Add -session-{id} for sticky sessions (same IP across requests)

SERP API Usage

For search engine results specifically, use the SERP API instead of general scraping:

curl "https://api.brightdata.com/serp/req" \
  -H "Authorization: Bearer ${BRIGHTDATA_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "best coffee shops in austin",
    "search_engine": "google",
    "country": "us",
    "num_results": 20,
    "parse": true
  }'

The parse: true flag returns structured JSON with title, URL, snippet, and position for each result.

Detailed Process

Assess the Target -- Check the URL. Is it a public page? Does it require JS? Is it a search engine?
Start at Tier 1 -- Try direct fetch with curl or WebFetch. Inspect the response for real content.
Evaluate Response -- Check for: 403/429 status, CAPTCHA HTML, empty body, bot detection messages.
Escalate if Needed -- Move to Tier 2 (proxy) or Tier 3 (browser/unlocker) based on failure type.
Extract Content -- Parse the successful HTML response for the target data.
Return Structured Output -- Format extracted data as JSON matching the parser skill schema.

When to Escalate Between Tiers

Signal	Current Tier	Action
HTTP 200 with real content	Any	Success -- do not escalate
HTTP 403 or 429	Tier 1	Escalate to Tier 2 (datacenter proxy)
Bot detection page	Tier 2a	Escalate to Tier 2b (residential proxy)
CAPTCHA challenge	Tier 2b	Escalate to Tier 3 (Web Unlocker)
JavaScript-rendered content	Tier 1 or 2	Escalate to Tier 3 (browser rendering)
Search engine results	Any	Use SERP API directly

When to Use

User provides a URL and asks to "scrape it", "get the content", "extract data from this site"
WebFetch returns blocked/empty content and escalation is needed
Bulk URL scraping where some sites have anti-bot protection
Search engine result collection (use SERP API path directly)
Price monitoring, competitive analysis, or market research data collection

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.