site-harvest

by @WebSmartTeam in Web & API

# Install this skill:

npx skills add WebSmartTeam/COR-CODE --skill "site-harvest"

Install specific skill from multi-skill repository

# Description

Extract complete website content, design system, and assets for rebuilding or migration. Uses Firecrawl for content/CSS extraction, Chrome for visual comparison. Generates theme skill file for rebuild. Triggers: harvest site, scrape website, extract design, clone website, migrate site, copy website design, grab design tokens.

# SKILL.md

name: site-harvest
description: Extract complete website content, design system, and assets for rebuilding or migration. Uses Firecrawl for content/CSS extraction, Chrome for visual comparison. Generates theme skill file for rebuild. Triggers: harvest site, scrape website, extract design, clone website, migrate site, copy website design, grab design tokens.
updated: 2025-01-18
user-invocable: true
allowed-tools:
- Read
- Write
- Edit
- Bash
- Glob
- Grep
- TodoWrite
- WebFetch
- mcp__firecrawl__firecrawl_scrape
- mcp__firecrawl__firecrawl_map
- mcp__firecrawl__firecrawl_crawl
- mcp__firecrawl__firecrawl_extract

Site Harvest Skill

Purpose: Extract complete website content, design system, and assets for rebuilding or migration.

Trigger: /site-harvest [url] or "harvest [url]"

Architecture: Token-Efficient Hybrid

Task	Tool	Why
URL Discovery	Firecrawl `map`	Their tokens, instant, free 500 pages
Bulk Content	Firecrawl `crawl`	Their tokens, parallel extraction
Structured Data	Firecrawl `extract`	Schema-based, accurate
Sitemap Analysis	Claude + WebFetch	Quick XML parse, freshness check
Screaming Frog	CSV import	Pre-crawled, comprehensive
Design Screenshots	Firecrawl `scrape`	Branding format extracts design
Side-by-Side Compare	Chrome Tab	Only for old vs new comparison

Token Strategy: Firecrawl does heavy lifting (their tokens) → Chrome only for visual comparison (our tokens)

Prerequisites

Firecrawl MCP - Primary extraction engine
Chrome Tab MCP - Only for side-by-side comparison phase
URL sources (one or more):
Site URL (Firecrawl discovers pages)
Screaming Frog export (CSV)
sitemap.xml (auto-detected)

Quick Start

/site-harvest https://example.com

9-Phase Workflow

Phase 1: URL Discovery & Site Structure (FOUNDATION)

⚠️ THIS IS THE BASIS FOR EVERYTHING. Get this wrong = rebuild has gaps.

Ask for Screaming Frog export first (most comprehensive)
Analyse sitemap.xml via WebFetch (freshness, coverage)
Run Firecrawl map (live discoverable pages)
Check AJAX/Pagination (load more, infinite scroll, WordPress API)
Four-way comparison: Sitemap vs Firecrawl vs SF vs AJAX
Document site structure (page types, navigation, hierarchy)
Report findings and wait for confirmation

Detailed instructions: See references/url-discovery.md

Save: /url-discovery.json

Phase 2: Content Extraction (Firecrawl)

firecrawl_crawl({
  url: "https://example.com",
  limit: [merged URL count],
  scrapeOptions: {
    formats: ["markdown", "html", "links"],
    onlyMainContent: true
  }
})

For each page, save:
- /pages/[slug].md (clean markdown)
- /pages/[slug].json (structured: title, meta, headings)

Extract media references (images, videos, documents).

Save: /content-manifest.json

Phase 3: Design System Extraction

// Branding extraction
firecrawl_scrape({
  url: "https://example.com",
  formats: ["branding"]
})

// Full CSS capture
firecrawl_scrape({
  url: "https://example.com",
  formats: ["html", "rawHtml"]
})

Parse all stylesheet URLs and download CSS files
Extract CSS variables (--color-, --font-, --spacing-*)
Capture typography scale (h1-h6, p, small)

Save: /design-tokens.json

Phase 4: Component Style Catalogue (EXHAUSTIVE)

Capture EVERY visual pattern - this prevents "footer links unstyled" problems.

Extract computed styles for:
- Navigation (header, links, mobile menu, dropdowns)
- Footer (container, columns, links, social icons)
- Typography (headings, paragraphs, lists, blockquotes)
- Buttons & CTAs (primary, secondary, ghost, hover states)
- Sections (padding, backgrounds, alternating patterns)
- Dividers (hr, borders, SVG waves, clip-path angles)
- Cards (container, hover, image, content)
- Icons (download exact SVGs, don't substitute!)
- Forms (inputs, labels, error states)

Detailed element list: See references/component-styles.md

Save: /component-styles.json

Phase 5: Visual Capture (Screenshots)

firecrawl_scrape({
  url: "https://example.com",
  formats: [
    { type: "screenshot", fullPage: true, viewport: { width: 1920, height: 1080 } },
    { type: "screenshot", fullPage: true, viewport: { width: 768, height: 1024 } },
    { type: "screenshot", fullPage: true, viewport: { width: 375, height: 812 } }
  ]
})

Screenshot key pages (homepage, about, services, blog, contact) and components (header, footer, hero, cards, dividers).

Save to: /screenshots/

Phase 6: Asset Download

Images: Download all, maintain folder structure → /media/
Fonts: Parse @font-face, download all formats → /assets/fonts/
Icons: Extract inline SVGs exactly, download external SVGs → /assets/icons/
JavaScript: Download external JS, note inline scripts → /assets/scripts/

Phase 7: Theme Skill Generation

Generate: /[project-name]-theme.md

Document:
1. Brand identity (colours with hex + Tailwind classes)
2. Typography (fonts, sizes, weights)
3. Section patterns (padding, backgrounds, dividers)
4. Component specs (buttons, cards, links)
5. Layout patterns (grids, flexbox)
6. Special elements (wave SVGs, icons)
7. Tailwind classes reference

Template: See references/theme-generation.md

This is the single source of truth during rebuild.

Phase 8: Manifest Generation

Generate comprehensive manifest.json with:
- Harvest metadata (URL, date, tool version)
- URL discovery results (all sources compared)
- Pages list with files
- Design assets references
- Screenshots index
- Warnings and flags

Example manifest: See references/output-structure.md

Phase 9: Side-by-Side Comparison (Chrome Tab)

Only runs when BOTH old and new sites exist.

Open both sites in Chrome tabs
Screenshot both at same viewport
Compare: header, hero, sections, footer, dividers, icons
Flag mismatches with specifics
Generate comparison report

Detailed workflow: See references/rebuild-workflow.md

Critical Rules

❌ DON'T

Substitute icons with similar ones from icon libraries
Ignore wave/angle dividers
Skip footer link styling
Assume section spacing without measuring
Miss hover/active/focus states

✅ DO

Extract EXACT SVG markup for all custom icons
Capture ALL divider types (hr, border, SVG, clip-path)
Document EVERY link style (nav, footer, inline, CTA)
Measure actual padding/margin values
Screenshot unusual patterns for reference

Error Handling

Error	Action
Firecrawl rate limit	Wait, retry with smaller batch
sitemap.xml missing	Continue with Firecrawl + Screaming Frog
CSS file 404	Log warning, check for inline styles
Font file blocked	Note in manifest, may need manual download
SVG divider complex	Screenshot + extract raw HTML

Example Usage

# Basic harvest
/site-harvest https://client-site.co.uk

# With Screaming Frog export
/site-harvest https://example.com --urls screaming-frog-export.csv

# Comparison mode (after rebuild)
/site-harvest compare https://old-site.com https://new-site.vercel.app

Output Structure

See references/output-structure.md for complete folder layout and manifest example.

/scraped-data/[site-name]/
├── manifest.json
├── [site-name]-theme.md
├── url-discovery.json
├── design-tokens.json
├── component-styles.json
├── pages/
├── screenshots/
├── media/
├── assets/
└── comparison/

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.