webpage-to-md

by @wangminle in Web & API

# Install this skill:

npx skills add wangminle/skills-webpage-to-md --skill "webpage-to-md"

Install specific skill from multi-skill repository

# Description

Web scraping and Markdown conversion toolkit for extracting web content with images. Use when Claude needs to: (1) Save web articles/blogs as Markdown files, (2) Export WeChat articles (mp.weixin.qq.com), (3) Batch crawl Wiki sites and merge into single document, (4) Download webpage images locally, (5) Convert HTML tables/code blocks to Markdown format.

# SKILL.md

name: webpage-to-md
description: "Web scraping and Markdown conversion toolkit for extracting web content with images. Use when Claude needs to: (1) Save web articles/blogs as Markdown files, (2) Export WeChat articles (mp.weixin.qq.com), (3) Batch crawl Wiki sites and merge into single document, (4) Download webpage images locally, (5) Convert HTML tables/code blocks to Markdown format."

Web to Markdown Grabber

Extract web content and convert to clean Markdown with local images.

Script Location

This skill includes a Python script at scripts/grab_web_to_md.py.

When using this skill, replace SKILL_DIR with the actual skill installation path:
- Claude Code: ~/.claude/skills/webpage-to-md/
- Cursor: ~/.cursor/skills/webpage-to-md/ (if installed there)

Quick Start

# Single page export
python SKILL_DIR/scripts/grab_web_to_md.py "https://example.com/article" --out output.md --validate

# WeChat article (auto-detected)
python SKILL_DIR/scripts/grab_web_to_md.py "https://mp.weixin.qq.com/s/xxx" --out article.md

# Wiki batch crawl + merge
python SKILL_DIR/scripts/grab_web_to_md.py "https://wiki.example.com/index" \
  --crawl --crawl-pattern 'page=' \
  --merge --toc --merge-output wiki.md

Core Parameters

Parameter	Purpose	Example
`--out`	Output file path	`--out docs/article.md`
`--validate`	Verify image integrity	`--validate`
`--keep-html`	Preserve complex tables	`--keep-html`
`--tags`	Add YAML frontmatter tags	`--tags "ai,tutorial"`

Three Main Use Cases

1. Single Page Export (Blog/News)

python SKILL_DIR/scripts/grab_web_to_md.py "URL" \
  --out output.md \
  --keep-html \
  --tags "topic1,topic2" \
  --validate

Auto behavior: Downloads images to output.assets/, generates YAML frontmatter.

2. WeChat Article Export

python SKILL_DIR/scripts/grab_web_to_md.py "https://mp.weixin.qq.com/s/xxx" \
  --out article.md

Auto behavior: Detects WeChat URL → extracts rich_media_content → cleans interaction buttons.

3. Wiki Batch Crawl + Merge

python SKILL_DIR/scripts/grab_web_to_md.py "https://wiki.example.com/index" \
  --crawl \
  --crawl-pattern 'page=wiki' \
  --merge \
  --toc \
  --merge-output wiki_guide.md \
  --target-id body \
  --clean-wiki-noise \
  --rewrite-links \
  --download-images

Parameters explained:
- --crawl: Extract links from index page
- --crawl-pattern: Regex to filter content pages
- --merge --toc: Combine into single file with TOC
- --target-id body: Extract only main content area
- --clean-wiki-noise: Remove edit buttons, navigation links
- --rewrite-links: Convert external URLs to internal anchors
- --download-images: Save images locally

Content Extraction Parameters

Parameter	Purpose
`--target-id ID`	Extract element by id (e.g., `body`, `content`)
`--target-class CLASS`	Extract element by class (e.g., `article-body`)
`--clean-wiki-noise`	Remove Wiki system noise (PukiWiki/MediaWiki)
`--wechat`	Force WeChat article mode

Batch Processing Parameters

Parameter	Default	Purpose
`--urls-file`	-	Read URLs from file
`--max-workers`	3	Concurrent threads
`--delay`	1.0	Request interval (seconds)
`--skip-errors`	False	Continue on failures
`--download-images`	False	Download images locally

Security Parameters

Parameter	Default	Purpose
`--redact-url`	True	Remove query/fragment from URLs in output (default ON)
`--no-redact-url`	-	Keep full URLs including query params
`--no-map-json`	False	Skip generating *.assets.json mapping file (and remove existing one)
`--max-image-bytes`	25MB	Max size per image (0=unlimited)
`--pdf-allow-file-access`	False	Allow file:// access when generating PDF

Security features (always active):
- Cross-origin image downloads use clean session (no Cookie/Auth leak), including redirect chains
- Redirects back to same host switch back to credentialed session when needed
- Clean session inherits proxy/cert/adapters from the base session (still no sensitive headers)
- HTML attributes sanitized (removes on* events, javascript: URLs)
- Streaming download prevents OOM on large images

Anti-Scraping Support

# With cookies
python SKILL_DIR/scripts/grab_web_to_md.py "URL" --cookie "session=xxx"

# With custom headers
python SKILL_DIR/scripts/grab_web_to_md.py "URL" --header "Authorization: Bearer xxx"

# Change User-Agent
python SKILL_DIR/scripts/grab_web_to_md.py "URL" --ua-preset firefox-win

Output Structure

output.md                 # Markdown file
output.assets/            # Images directory
  ├── 01-hero.png
  └── 02-diagram.jpg
output.md.assets.json     # URL→local mapping

Common Site Configurations

Site Type	Recommended Parameters
PukiWiki	`--target-id body --clean-wiki-noise`
MediaWiki	`--target-id content --clean-wiki-noise`
WordPress	`--target-class entry-content`
WeChat	Auto-detected, or `--wechat`
Tech Blog	`--keep-html --tags`

Dependencies

Required: requests (HTTP requests)
Optional: markdown (for PDF export with --with-pdf)

Install: pip install requests

References

For complete documentation, see references/full-guide.md:
- All parameter explanations with defaults
- 9 usage scenarios with examples
- 3 detailed real-world cases
- Output structure diagrams
- Technical implementation details
- Changelog history

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.