wangminle

webpage-to-md

1
0
# Install this skill:
npx skills add wangminle/skills-webpage-to-md --skill "webpage-to-md"

Install specific skill from multi-skill repository

# Description

Web scraping and Markdown conversion toolkit for extracting web content with images. Use when Claude needs to: (1) Save web articles/blogs as Markdown files, (2) Export WeChat articles (mp.weixin.qq.com), (3) Batch crawl Wiki sites and merge into single document, (4) Download webpage images locally, (5) Convert HTML tables/code blocks to Markdown format.

# SKILL.md


name: webpage-to-md
description: "Web scraping and Markdown conversion toolkit for extracting web content with images. Use when Claude needs to: (1) Save web articles/blogs as Markdown files, (2) Export WeChat articles (mp.weixin.qq.com), (3) Batch crawl Wiki sites and merge into single document, (4) Download webpage images locally, (5) Convert HTML tables/code blocks to Markdown format."


Web to Markdown Grabber

Extract web content and convert to clean Markdown with local images.

Script Location

This skill includes a Python script at scripts/grab_web_to_md.py.

When using this skill, replace SKILL_DIR with the actual skill installation path:
- Claude Code: ~/.claude/skills/webpage-to-md/
- Cursor: ~/.cursor/skills/webpage-to-md/ (if installed there)

Quick Start

# Single page export
python SKILL_DIR/scripts/grab_web_to_md.py "https://example.com/article" --out output.md --validate

# WeChat article (auto-detected)
python SKILL_DIR/scripts/grab_web_to_md.py "https://mp.weixin.qq.com/s/xxx" --out article.md

# Wiki batch crawl + merge
python SKILL_DIR/scripts/grab_web_to_md.py "https://wiki.example.com/index" \
  --crawl --crawl-pattern 'page=' \
  --merge --toc --merge-output wiki.md

Core Parameters

Parameter Purpose Example
--out Output file path --out docs/article.md
--validate Verify image integrity --validate
--keep-html Preserve complex tables --keep-html
--tags Add YAML frontmatter tags --tags "ai,tutorial"

Three Main Use Cases

1. Single Page Export (Blog/News)

python SKILL_DIR/scripts/grab_web_to_md.py "URL" \
  --out output.md \
  --keep-html \
  --tags "topic1,topic2" \
  --validate

Auto behavior: Downloads images to output.assets/, generates YAML frontmatter.

2. WeChat Article Export

python SKILL_DIR/scripts/grab_web_to_md.py "https://mp.weixin.qq.com/s/xxx" \
  --out article.md

Auto behavior: Detects WeChat URL โ†’ extracts rich_media_content โ†’ cleans interaction buttons.

3. Wiki Batch Crawl + Merge

python SKILL_DIR/scripts/grab_web_to_md.py "https://wiki.example.com/index" \
  --crawl \
  --crawl-pattern 'page=wiki' \
  --merge \
  --toc \
  --merge-output wiki_guide.md \
  --target-id body \
  --clean-wiki-noise \
  --rewrite-links \
  --download-images

Parameters explained:
- --crawl: Extract links from index page
- --crawl-pattern: Regex to filter content pages
- --merge --toc: Combine into single file with TOC
- --target-id body: Extract only main content area
- --clean-wiki-noise: Remove edit buttons, navigation links
- --rewrite-links: Convert external URLs to internal anchors
- --download-images: Save images locally

Content Extraction Parameters

Parameter Purpose
--target-id ID Extract element by id (e.g., body, content)
--target-class CLASS Extract element by class (e.g., article-body)
--clean-wiki-noise Remove Wiki system noise (PukiWiki/MediaWiki)
--wechat Force WeChat article mode

Batch Processing Parameters

Parameter Default Purpose
--urls-file - Read URLs from file
--max-workers 3 Concurrent threads
--delay 1.0 Request interval (seconds)
--skip-errors False Continue on failures
--download-images False Download images locally

Security Parameters

Parameter Default Purpose
--redact-url True Remove query/fragment from URLs in output (default ON)
--no-redact-url - Keep full URLs including query params
--no-map-json False Skip generating *.assets.json mapping file (and remove existing one)
--max-image-bytes 25MB Max size per image (0=unlimited)
--pdf-allow-file-access False Allow file:// access when generating PDF

Security features (always active):
- Cross-origin image downloads use clean session (no Cookie/Auth leak), including redirect chains
- Redirects back to same host switch back to credentialed session when needed
- Clean session inherits proxy/cert/adapters from the base session (still no sensitive headers)
- HTML attributes sanitized (removes on* events, javascript: URLs)
- Streaming download prevents OOM on large images

Anti-Scraping Support

# With cookies
python SKILL_DIR/scripts/grab_web_to_md.py "URL" --cookie "session=xxx"

# With custom headers
python SKILL_DIR/scripts/grab_web_to_md.py "URL" --header "Authorization: Bearer xxx"

# Change User-Agent
python SKILL_DIR/scripts/grab_web_to_md.py "URL" --ua-preset firefox-win

Output Structure

output.md                 # Markdown file
output.assets/            # Images directory
  โ”œโ”€โ”€ 01-hero.png
  โ””โ”€โ”€ 02-diagram.jpg
output.md.assets.json     # URLโ†’local mapping

Common Site Configurations

Site Type Recommended Parameters
PukiWiki --target-id body --clean-wiki-noise
MediaWiki --target-id content --clean-wiki-noise
WordPress --target-class entry-content
WeChat Auto-detected, or --wechat
Tech Blog --keep-html --tags

Dependencies

  • Required: requests (HTTP requests)
  • Optional: markdown (for PDF export with --with-pdf)

Install: pip install requests

References

For complete documentation, see references/full-guide.md:
- All parameter explanations with defaults
- 9 usage scenarios with examples
- 3 detailed real-world cases
- Output structure diagrams
- Technical implementation details
- Changelog history

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.