Build or update the BlueBubbles external channel plugin for Moltbot (extension package, REST...
npx skills add wangminle/skills-webpage-to-md --skill "webpage-to-md"
Install specific skill from multi-skill repository
# Description
Web scraping and Markdown conversion toolkit for extracting web content with images. Use when Claude needs to: (1) Save web articles/blogs as Markdown files, (2) Export WeChat articles (mp.weixin.qq.com), (3) Batch crawl Wiki sites and merge into single document, (4) Download webpage images locally, (5) Convert HTML tables/code blocks to Markdown format.
# SKILL.md
name: webpage-to-md
description: "Web scraping and Markdown conversion toolkit for extracting web content with images. Use when Claude needs to: (1) Save web articles/blogs as Markdown files, (2) Export WeChat articles (mp.weixin.qq.com), (3) Batch crawl Wiki sites and merge into single document, (4) Download webpage images locally, (5) Convert HTML tables/code blocks to Markdown format."
Web to Markdown Grabber
Extract web content and convert to clean Markdown with local images.
Script Location
This skill includes a Python script at scripts/grab_web_to_md.py.
When using this skill, replace SKILL_DIR with the actual skill installation path:
- Claude Code: ~/.claude/skills/webpage-to-md/
- Cursor: ~/.cursor/skills/webpage-to-md/ (if installed there)
Quick Start
# Single page export
python SKILL_DIR/scripts/grab_web_to_md.py "https://example.com/article" --out output.md --validate
# WeChat article (auto-detected)
python SKILL_DIR/scripts/grab_web_to_md.py "https://mp.weixin.qq.com/s/xxx" --out article.md
# Wiki batch crawl + merge
python SKILL_DIR/scripts/grab_web_to_md.py "https://wiki.example.com/index" \
--crawl --crawl-pattern 'page=' \
--merge --toc --merge-output wiki.md
Core Parameters
| Parameter | Purpose | Example |
|---|---|---|
--out |
Output file path | --out docs/article.md |
--validate |
Verify image integrity | --validate |
--keep-html |
Preserve complex tables | --keep-html |
--tags |
Add YAML frontmatter tags | --tags "ai,tutorial" |
Three Main Use Cases
1. Single Page Export (Blog/News)
python SKILL_DIR/scripts/grab_web_to_md.py "URL" \
--out output.md \
--keep-html \
--tags "topic1,topic2" \
--validate
Auto behavior: Downloads images to output.assets/, generates YAML frontmatter.
2. WeChat Article Export
python SKILL_DIR/scripts/grab_web_to_md.py "https://mp.weixin.qq.com/s/xxx" \
--out article.md
Auto behavior: Detects WeChat URL โ extracts rich_media_content โ cleans interaction buttons.
3. Wiki Batch Crawl + Merge
python SKILL_DIR/scripts/grab_web_to_md.py "https://wiki.example.com/index" \
--crawl \
--crawl-pattern 'page=wiki' \
--merge \
--toc \
--merge-output wiki_guide.md \
--target-id body \
--clean-wiki-noise \
--rewrite-links \
--download-images
Parameters explained:
- --crawl: Extract links from index page
- --crawl-pattern: Regex to filter content pages
- --merge --toc: Combine into single file with TOC
- --target-id body: Extract only main content area
- --clean-wiki-noise: Remove edit buttons, navigation links
- --rewrite-links: Convert external URLs to internal anchors
- --download-images: Save images locally
Content Extraction Parameters
| Parameter | Purpose |
|---|---|
--target-id ID |
Extract element by id (e.g., body, content) |
--target-class CLASS |
Extract element by class (e.g., article-body) |
--clean-wiki-noise |
Remove Wiki system noise (PukiWiki/MediaWiki) |
--wechat |
Force WeChat article mode |
Batch Processing Parameters
| Parameter | Default | Purpose |
|---|---|---|
--urls-file |
- | Read URLs from file |
--max-workers |
3 | Concurrent threads |
--delay |
1.0 | Request interval (seconds) |
--skip-errors |
False | Continue on failures |
--download-images |
False | Download images locally |
Security Parameters
| Parameter | Default | Purpose |
|---|---|---|
--redact-url |
True | Remove query/fragment from URLs in output (default ON) |
--no-redact-url |
- | Keep full URLs including query params |
--no-map-json |
False | Skip generating *.assets.json mapping file (and remove existing one) |
--max-image-bytes |
25MB | Max size per image (0=unlimited) |
--pdf-allow-file-access |
False | Allow file:// access when generating PDF |
Security features (always active):
- Cross-origin image downloads use clean session (no Cookie/Auth leak), including redirect chains
- Redirects back to same host switch back to credentialed session when needed
- Clean session inherits proxy/cert/adapters from the base session (still no sensitive headers)
- HTML attributes sanitized (removes on* events, javascript: URLs)
- Streaming download prevents OOM on large images
Anti-Scraping Support
# With cookies
python SKILL_DIR/scripts/grab_web_to_md.py "URL" --cookie "session=xxx"
# With custom headers
python SKILL_DIR/scripts/grab_web_to_md.py "URL" --header "Authorization: Bearer xxx"
# Change User-Agent
python SKILL_DIR/scripts/grab_web_to_md.py "URL" --ua-preset firefox-win
Output Structure
output.md # Markdown file
output.assets/ # Images directory
โโโ 01-hero.png
โโโ 02-diagram.jpg
output.md.assets.json # URLโlocal mapping
Common Site Configurations
| Site Type | Recommended Parameters |
|---|---|
| PukiWiki | --target-id body --clean-wiki-noise |
| MediaWiki | --target-id content --clean-wiki-noise |
| WordPress | --target-class entry-content |
Auto-detected, or --wechat |
|
| Tech Blog | --keep-html --tags |
Dependencies
- Required:
requests(HTTP requests) - Optional:
markdown(for PDF export with--with-pdf)
Install: pip install requests
References
For complete documentation, see references/full-guide.md:
- All parameter explanations with defaults
- 9 usage scenarios with examples
- 3 detailed real-world cases
- Output structure diagrams
- Technical implementation details
- Changelog history
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.