web-scraper-builder

by @JackZhu426 in Web & API

# Install this skill:

npx skills add JackZhu426/web-scraper-crawlee-skill

Or install specific skill: npx add-skill https://github.com/JackZhu426/web-scraper-crawlee-skill

# Description

# SKILL.md

name: web-scraper-builder
description: |
Build production-ready web scrapers using Crawlee and Playwright. This skill guides you through analyzing websites, identifying selectors, and generating crawler code that follows best practices.

Use when:
- User wants to "scrape a website", "extract data from a page", or "build a crawler"
- User provides a URL and asks to extract product/deal/content information
- User needs help finding CSS selectors for web elements
- User wants to create an Apify actor for data extraction
- User asks about "web scraping", "DOM analysis", or "Playwright selectors"

Triggers: "scrape", "crawler", "extract from website", "web scraping", "find selectors", "Apify actor", "Playwright", "Crawlee"

Web Scraper Builder

Build production-ready Crawlee + Playwright crawlers through guided analysis and code generation.

Workflow

Phase 1: Gather Requirements

Ask the user:

Target URL: "What's the URL of the page you want to scrape?"
Data fields: "What data do you need to extract? (e.g., product title, price, description, image, availability)"
Page type: "Is this a listing page (multiple items) or a detail page (single item)?"
Analysis method: "Would you like me to analyze the page live using browser tools, or would you prefer to provide HTML snippets?"

Phase 2: Analyze Page Structure

Option A: Live Analysis (Recommended)
Use mcp__claude-in-chrome__* or mcp__chrome-devtools__* tools to:
1. Navigate to the target URL
2. Take a screenshot for visual reference
3. Analyze the DOM structure
4. Identify data-testid attributes, semantic elements, and class patterns

Option B: User-Provided HTML
If the user provides HTML snippets:
1. Analyze the provided markup
2. Identify stable selector patterns
3. Suggest multiple fallback strategies

Phase 3: Build Selectors

Follow these selector priority rules (most stable → least stable):

// Priority 1: data-testid (most stable)
page.locator('[data-testid="product-title"]')

// Priority 2: Semantic/ARIA
page.getByRole('button', { name: /add to cart/i })
page.getByLabel('Search')

// Priority 3: Partial class match (resilient to hash suffixes)
page.locator('[class*="product-card"]')
page.locator('div[class*="price"]')

// Priority 4: Attribute selectors
page.locator('img[src*="cdn"]')
page.locator('a[href*="/product/"]')

// Priority 5: Fallback chains using .or()
page.locator('[data-testid="price"]')
  .or(page.locator('[class*="price"]'))
  .or(page.locator('.price'))

Always avoid:
- Exact class names with hashes: .product-card-abc123
- Deep nth-child chains: div:nth-child(3) > div:nth-child(2)
- Overly generic selectors: span, div

See references/selector-patterns.md for comprehensive selector strategies.

Phase 4: Generate Crawler Code

Use the template at scripts/crawler-template.ts as a starting point. Customize:

Input interface - Define expected inputs (URLs, limits, flags)
Router handlers - Create handlers for listing vs detail pages
Extraction logic - Implement field extraction with fallbacks
Error handling - Wrap extractions in try-catch blocks

Phase 5: Test & Iterate

Run the crawler locally: pnpm start:dev or apify run --purge
Check extracted data in storage/datasets/default/
Refine selectors based on edge cases
Add missing fallbacks for failed extractions

Key Patterns

Always use `page.locator()`, never `document.querySelector()`

// ✅ Correct - Auto-waits, better errors
const title = await page.locator('h1[class*="title"]').textContent();

// ❌ Wrong - No auto-waiting, fragile
const title = await page.evaluate(() =>
  document.querySelector('h1.title')?.textContent
);

Batch extraction (for listing pages)

const products = await page.evaluate(() => {
  return Array.from(document.querySelectorAll('[data-testid="product-card"]'))
    .map(card => ({
      name: card.querySelector('[data-testid="title"]')?.textContent?.trim(),
      price: card.querySelector('[data-testid="price"]')?.textContent?.trim(),
      url: card.querySelector('a')?.getAttribute('href'),
    }))
    .filter(Boolean);
});

Waiting strategies

// Wait for element
await page.locator('[data-testid="products"]').waitFor({ timeout: 30000 });

// Wait for network idle (after clicking "Load More")
await page.waitForLoadState('networkidle', { timeout: 15000 });

// Wait for count to increase
await page.waitForFunction(
  (min) => document.querySelectorAll('.product').length > min,
  currentCount,
  { timeout: 15000 }
);

Handling popups/modals

const cookieBtn = page.locator('button:has-text("Accept")').first();
if (await cookieBtn.isVisible({ timeout: 3000 }).catch(() => false)) {
  await cookieBtn.click();
}
await page.keyboard.press('Escape'); // Fallback

Resources

scripts/crawler-template.ts - Complete Crawlee + Playwright boilerplate
references/selector-patterns.md - Comprehensive selector strategy guide
references/common-ecommerce-patterns.md - Patterns for product pages, deals, pagination

Output

Deliver:
1. Selectors document - List of identified selectors with confidence ratings
2. Crawler code - Complete TypeScript file using Crawlee + Playwright
3. Testing instructions - How to run and validate the crawler

# README.md

Web Scraper Builder Skill

A Claude Code skill for building production-ready web scrapers using Crawlee and Playwright.

What This Skill Does

Guides you through the complete web scraping workflow:

Gather Requirements - Asks about target URL, data fields, and page type
Analyze Page Structure - Uses browser automation or user-provided HTML
Build Selectors - Creates resilient CSS selectors with fallback chains
Generate Crawler Code - Produces complete TypeScript code using Crawlee + Playwright
Test & Iterate - Guides local testing and refinement

Features

Live DOM Analysis - Uses Claude in Chrome or Chrome DevTools MCP tools
Resilient Selectors - Priority-based strategy (data-testid → ARIA → class* → attributes)
Fallback Chains - Automatic .or() fallback patterns for reliability
Production Templates - Complete Crawlee + Playwright boilerplate
E-commerce Patterns - Pre-built patterns for products, deals, pagination

Installation

Claude Code discovers skills by scanning for SKILL.md files in specific directories.

Option 1: Clone directly to skills directory (recommended)

# Personal scope (all your projects)
cd ~/.claude/skills
git clone https://github.com/JackZhu426/web-scraper-crawlee-skill.git

# Or project scope (current project only)
cd your-project/.claude/skills
git clone https://github.com/JackZhu426/web-scraper-crawlee-skill.git

Option 2: Download and copy

# Download/clone anywhere
git clone https://github.com/JackZhu426/web-scraper-crawlee-skill.git

# Copy to Claude skills directory
cp -r web-scraper-builder-skill ~/.claude/skills/web-scraper-builder

Option 3: Extract from .skill file

Download web-scraper-builder.skill from releases, then:

cd ~/.claude/skills
unzip web-scraper-builder.skill

Verify Installation

The skill should appear when you run /web-scraper-builder in Claude Code.

Usage

Once installed, trigger the skill by asking Claude Code:

"I want to scrape products from this website"
"Help me build a crawler for [URL]"
"Find selectors for extracting prices from this page"
"Create an Apify actor for data extraction"

Or invoke directly: /web-scraper-builder

Included Resources

File	Description
`SKILL.md`	Core skill instructions
`scripts/crawler-template.ts`	Complete Crawlee + Playwright boilerplate
`references/selector-patterns.md`	Comprehensive selector strategy guide
`references/common-ecommerce-patterns.md`	E-commerce extraction patterns

Requirements

Claude Code CLI
Node.js 18+ (for running generated crawlers)
pnpm (recommended) or npm

License

MIT License - see LICENSE

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.