0
0
# Install this skill:
npx skills add Anshin-Health-Solutions/superpai --skill "brightdata"

Install specific skill from multi-skill repository

# Description

Progressive URL scraping with Bright Data. Multiple tiers from free to premium.

# SKILL.md


name: brightdata
description: "Progressive URL scraping with Bright Data. Multiple tiers from free to premium."
triggers:
- Bright Data
- scrape URL
- web scraping
- data collection


Bright Data Skill

Progressive web scraping using a three-tier escalation methodology. Always start at the cheapest tier and escalate only when blocked. This skill covers direct fetching, proxy rotation, and full browser rendering.

Three-Tier Progressive Scraping Methodology

Tier 1: Direct Fetch (Free)

Tools: curl, WebFetch tool
Use when: Target site has no anti-bot protection, public content, no JavaScript rendering required.

# Simple direct fetch
curl -s -o output.html "https://example.com/page"

# Or use WebFetch tool with extraction prompt
# WebFetch url="https://example.com/page" prompt="Extract the main article text and metadata"

Escalate to Tier 2 when: You receive 403, 429, CAPTCHA challenges, or empty/bot-detection pages.

Tier 2: Proxy Rotation (Standard)

Tools: Bright Data residential/datacenter proxies
Cost: ~$0.10-0.60 per GB depending on proxy zone type

# Datacenter proxy (cheapest, least residential)
curl -x "http://brd-customer-{CUSTOMER_ID}-zone-datacenter:[email protected]:22225" \
  -s "https://target-site.com/page"

# Residential proxy (more expensive, higher success rate)
curl -x "http://brd-customer-{CUSTOMER_ID}-zone-residential:[email protected]:22225" \
  -s "https://target-site.com/page"

# With country targeting
curl -x "http://brd-customer-{CUSTOMER_ID}-zone-residential-country-us:[email protected]:22225" \
  -s "https://target-site.com/page"

Escalate to Tier 3 when: Proxy requests still return bot detection, site requires JavaScript rendering, or content loads dynamically.

Tier 3: Browser Rendering (Premium)

Tools: Bright Data Scraping Browser or Web Unlocker
Cost: ~$1.00-3.00 per 1K requests

# Web Unlocker API (handles CAPTCHAs, fingerprinting, rendering)
curl -x "http://brd-customer-{CUSTOMER_ID}-zone-unlocker:[email protected]:22225" \
  -s "https://heavily-protected-site.com/page"

# SERP API (specialized for search engines)
curl "https://api.brightdata.com/serp/req?customer={CUSTOMER_ID}&zone=serp" \
  -H "Authorization: Bearer ${BRIGHTDATA_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{"query": "search terms", "search_engine": "google", "country": "us"}'

Cost Comparison Table

Tier Method Cost per 1K Requests Success Rate Speed
1 Direct curl/WebFetch Free 40-60% Fastest
2a Datacenter proxy ~$0.10 60-75% Fast
2b Residential proxy ~$0.50 80-90% Medium
3a Web Unlocker ~$2.00 95-99% Slower
3b SERP API ~$3.00 99%+ Slowest

Proxy Zone Configuration

Zones are configured in the Bright Data dashboard (brightdata.com/cp). Each zone has:
- Zone Name: Identifier used in proxy URL (e.g., datacenter, residential, unlocker)
- Proxy Type: Datacenter, ISP, Residential, or Mobile
- Country Targeting: Append -country-{code} to zone name
- Session Management: Add -session-{id} for sticky sessions (same IP across requests)

SERP API Usage

For search engine results specifically, use the SERP API instead of general scraping:

curl "https://api.brightdata.com/serp/req" \
  -H "Authorization: Bearer ${BRIGHTDATA_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "best coffee shops in austin",
    "search_engine": "google",
    "country": "us",
    "num_results": 20,
    "parse": true
  }'

The parse: true flag returns structured JSON with title, URL, snippet, and position for each result.

Detailed Process

  1. Assess the Target -- Check the URL. Is it a public page? Does it require JS? Is it a search engine?
  2. Start at Tier 1 -- Try direct fetch with curl or WebFetch. Inspect the response for real content.
  3. Evaluate Response -- Check for: 403/429 status, CAPTCHA HTML, empty body, bot detection messages.
  4. Escalate if Needed -- Move to Tier 2 (proxy) or Tier 3 (browser/unlocker) based on failure type.
  5. Extract Content -- Parse the successful HTML response for the target data.
  6. Return Structured Output -- Format extracted data as JSON matching the parser skill schema.

When to Escalate Between Tiers

Signal Current Tier Action
HTTP 200 with real content Any Success -- do not escalate
HTTP 403 or 429 Tier 1 Escalate to Tier 2 (datacenter proxy)
Bot detection page Tier 2a Escalate to Tier 2b (residential proxy)
CAPTCHA challenge Tier 2b Escalate to Tier 3 (Web Unlocker)
JavaScript-rendered content Tier 1 or 2 Escalate to Tier 3 (browser rendering)
Search engine results Any Use SERP API directly

When to Use

  • User provides a URL and asks to "scrape it", "get the content", "extract data from this site"
  • WebFetch returns blocked/empty content and escalation is needed
  • Bulk URL scraping where some sites have anti-bot protection
  • Search engine result collection (use SERP API path directly)
  • Price monitoring, competitive analysis, or market research data collection

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.