JackZhu426

web-scraper-builder

0
0
# Install this skill:
npx skills add JackZhu426/web-scraper-crawlee-skill

Or install specific skill: npx add-skill https://github.com/JackZhu426/web-scraper-crawlee-skill

# Description

|

# SKILL.md


name: web-scraper-builder
description: |
Build production-ready web scrapers using Crawlee and Playwright. This skill guides you through analyzing websites, identifying selectors, and generating crawler code that follows best practices.

Use when:
- User wants to "scrape a website", "extract data from a page", or "build a crawler"
- User provides a URL and asks to extract product/deal/content information
- User needs help finding CSS selectors for web elements
- User wants to create an Apify actor for data extraction
- User asks about "web scraping", "DOM analysis", or "Playwright selectors"

Triggers: "scrape", "crawler", "extract from website", "web scraping", "find selectors", "Apify actor", "Playwright", "Crawlee"

Web Scraper Builder

Build production-ready Crawlee + Playwright crawlers through guided analysis and code generation.

Workflow

Phase 1: Gather Requirements

Ask the user:

  1. Target URL: "What's the URL of the page you want to scrape?"
  2. Data fields: "What data do you need to extract? (e.g., product title, price, description, image, availability)"
  3. Page type: "Is this a listing page (multiple items) or a detail page (single item)?"
  4. Analysis method: "Would you like me to analyze the page live using browser tools, or would you prefer to provide HTML snippets?"

Phase 2: Analyze Page Structure

Option A: Live Analysis (Recommended)
Use mcp__claude-in-chrome__* or mcp__chrome-devtools__* tools to:
1. Navigate to the target URL
2. Take a screenshot for visual reference
3. Analyze the DOM structure
4. Identify data-testid attributes, semantic elements, and class patterns

Option B: User-Provided HTML
If the user provides HTML snippets:
1. Analyze the provided markup
2. Identify stable selector patterns
3. Suggest multiple fallback strategies

Phase 3: Build Selectors

Follow these selector priority rules (most stable β†’ least stable):

// Priority 1: data-testid (most stable)
page.locator('[data-testid="product-title"]')

// Priority 2: Semantic/ARIA
page.getByRole('button', { name: /add to cart/i })
page.getByLabel('Search')

// Priority 3: Partial class match (resilient to hash suffixes)
page.locator('[class*="product-card"]')
page.locator('div[class*="price"]')

// Priority 4: Attribute selectors
page.locator('img[src*="cdn"]')
page.locator('a[href*="/product/"]')

// Priority 5: Fallback chains using .or()
page.locator('[data-testid="price"]')
  .or(page.locator('[class*="price"]'))
  .or(page.locator('.price'))

Always avoid:
- Exact class names with hashes: .product-card-abc123
- Deep nth-child chains: div:nth-child(3) > div:nth-child(2)
- Overly generic selectors: span, div

See references/selector-patterns.md for comprehensive selector strategies.

Phase 4: Generate Crawler Code

Use the template at scripts/crawler-template.ts as a starting point. Customize:

  1. Input interface - Define expected inputs (URLs, limits, flags)
  2. Router handlers - Create handlers for listing vs detail pages
  3. Extraction logic - Implement field extraction with fallbacks
  4. Error handling - Wrap extractions in try-catch blocks

Phase 5: Test & Iterate

  1. Run the crawler locally: pnpm start:dev or apify run --purge
  2. Check extracted data in storage/datasets/default/
  3. Refine selectors based on edge cases
  4. Add missing fallbacks for failed extractions

Key Patterns

Always use page.locator(), never document.querySelector()

// βœ… Correct - Auto-waits, better errors
const title = await page.locator('h1[class*="title"]').textContent();

// ❌ Wrong - No auto-waiting, fragile
const title = await page.evaluate(() =>
  document.querySelector('h1.title')?.textContent
);

Batch extraction (for listing pages)

const products = await page.evaluate(() => {
  return Array.from(document.querySelectorAll('[data-testid="product-card"]'))
    .map(card => ({
      name: card.querySelector('[data-testid="title"]')?.textContent?.trim(),
      price: card.querySelector('[data-testid="price"]')?.textContent?.trim(),
      url: card.querySelector('a')?.getAttribute('href'),
    }))
    .filter(Boolean);
});

Waiting strategies

// Wait for element
await page.locator('[data-testid="products"]').waitFor({ timeout: 30000 });

// Wait for network idle (after clicking "Load More")
await page.waitForLoadState('networkidle', { timeout: 15000 });

// Wait for count to increase
await page.waitForFunction(
  (min) => document.querySelectorAll('.product').length > min,
  currentCount,
  { timeout: 15000 }
);

Handling popups/modals

const cookieBtn = page.locator('button:has-text("Accept")').first();
if (await cookieBtn.isVisible({ timeout: 3000 }).catch(() => false)) {
  await cookieBtn.click();
}
await page.keyboard.press('Escape'); // Fallback

Resources

  • scripts/crawler-template.ts - Complete Crawlee + Playwright boilerplate
  • references/selector-patterns.md - Comprehensive selector strategy guide
  • references/common-ecommerce-patterns.md - Patterns for product pages, deals, pagination

Output

Deliver:
1. Selectors document - List of identified selectors with confidence ratings
2. Crawler code - Complete TypeScript file using Crawlee + Playwright
3. Testing instructions - How to run and validate the crawler

# README.md

Web Scraper Builder Skill

A Claude Code skill for building production-ready web scrapers using Crawlee and Playwright.

What This Skill Does

Guides you through the complete web scraping workflow:

  1. Gather Requirements - Asks about target URL, data fields, and page type
  2. Analyze Page Structure - Uses browser automation or user-provided HTML
  3. Build Selectors - Creates resilient CSS selectors with fallback chains
  4. Generate Crawler Code - Produces complete TypeScript code using Crawlee + Playwright
  5. Test & Iterate - Guides local testing and refinement

Features

  • Live DOM Analysis - Uses Claude in Chrome or Chrome DevTools MCP tools
  • Resilient Selectors - Priority-based strategy (data-testid β†’ ARIA β†’ class* β†’ attributes)
  • Fallback Chains - Automatic .or() fallback patterns for reliability
  • Production Templates - Complete Crawlee + Playwright boilerplate
  • E-commerce Patterns - Pre-built patterns for products, deals, pagination

Installation

Claude Code discovers skills by scanning for SKILL.md files in specific directories.

# Personal scope (all your projects)
cd ~/.claude/skills
git clone https://github.com/JackZhu426/web-scraper-crawlee-skill.git

# Or project scope (current project only)
cd your-project/.claude/skills
git clone https://github.com/JackZhu426/web-scraper-crawlee-skill.git

Option 2: Download and copy

# Download/clone anywhere
git clone https://github.com/JackZhu426/web-scraper-crawlee-skill.git

# Copy to Claude skills directory
cp -r web-scraper-builder-skill ~/.claude/skills/web-scraper-builder

Option 3: Extract from .skill file

Download web-scraper-builder.skill from releases, then:

cd ~/.claude/skills
unzip web-scraper-builder.skill

Verify Installation

The skill should appear when you run /web-scraper-builder in Claude Code.

Usage

Once installed, trigger the skill by asking Claude Code:

  • "I want to scrape products from this website"
  • "Help me build a crawler for [URL]"
  • "Find selectors for extracting prices from this page"
  • "Create an Apify actor for data extraction"

Or invoke directly: /web-scraper-builder

Included Resources

File Description
SKILL.md Core skill instructions
scripts/crawler-template.ts Complete Crawlee + Playwright boilerplate
references/selector-patterns.md Comprehensive selector strategy guide
references/common-ecommerce-patterns.md E-commerce extraction patterns

Requirements

  • Claude Code CLI
  • Node.js 18+ (for running generated crawlers)
  • pnpm (recommended) or npm

License

MIT License - see LICENSE

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.