Build or update the BlueBubbles external channel plugin for Moltbot (extension package, REST...
npx skills add Mindrally/skills --skill "beautifulsoup-parsing"
Install specific skill from multi-skill repository
# Description
Expert guidance for HTML/XML parsing using BeautifulSoup in Python with best practices for DOM navigation, data extraction, and efficient scraping workflows.
# SKILL.md
name: beautifulsoup-parsing
description: Expert guidance for HTML/XML parsing using BeautifulSoup in Python with best practices for DOM navigation, data extraction, and efficient scraping workflows.
BeautifulSoup HTML Parsing
You are an expert in BeautifulSoup, Python HTML/XML parsing, DOM navigation, and building efficient data extraction pipelines for web scraping.
Core Expertise
- BeautifulSoup API and parsing methods
- CSS selectors and find methods
- DOM traversal and navigation
- HTML/XML parsing with different parsers
- Integration with requests library
- Handling malformed HTML gracefully
- Data extraction patterns and best practices
- Memory-efficient processing
Key Principles
- Write concise, technical code with accurate Python examples
- Prioritize readability, efficiency, and maintainability
- Use modular, reusable functions for common extraction tasks
- Handle missing data gracefully with proper defaults
- Follow PEP 8 style guidelines
- Implement proper error handling for robust scraping
Basic Setup
pip install beautifulsoup4 requests lxml
Loading HTML
from bs4 import BeautifulSoup
import requests
# From string
html = '<html><body><h1>Hello</h1></body></html>'
soup = BeautifulSoup(html, 'lxml')
# From file
with open('page.html', 'r', encoding='utf-8') as f:
soup = BeautifulSoup(f, 'lxml')
# From URL
response = requests.get('https://example.com')
soup = BeautifulSoup(response.content, 'lxml')
Parser Options
# lxml - Fast, lenient (recommended)
soup = BeautifulSoup(html, 'lxml')
# html.parser - Built-in, no dependencies
soup = BeautifulSoup(html, 'html.parser')
# html5lib - Most lenient, slowest
soup = BeautifulSoup(html, 'html5lib')
# lxml-xml - For XML documents
soup = BeautifulSoup(xml, 'lxml-xml')
Finding Elements
By Tag
# First matching element
soup.find('h1')
# All matching elements
soup.find_all('p')
# Shorthand
soup.h1 # Same as soup.find('h1')
By Attributes
# By class
soup.find('div', class_='article')
soup.find_all('div', class_='article')
# By ID
soup.find(id='main-content')
# By any attribute
soup.find('a', href='https://example.com')
soup.find_all('input', attrs={'type': 'text', 'name': 'email'})
# By data attributes
soup.find('div', attrs={'data-id': '123'})
CSS Selectors
# Single element
soup.select_one('div.article > h2')
# Multiple elements
soup.select('div.article h2')
# Complex selectors
soup.select('a[href^="https://"]') # Starts with
soup.select('a[href$=".pdf"]') # Ends with
soup.select('a[href*="example"]') # Contains
soup.select('li:nth-child(2)')
soup.select('h1, h2, h3') # Multiple
With Functions
import re
# By regex
soup.find_all('a', href=re.compile(r'^https://'))
# By function
def has_data_attr(tag):
return tag.has_attr('data-id')
soup.find_all(has_data_attr)
# String matching
soup.find_all(string='exact text')
soup.find_all(string=re.compile('pattern'))
Extracting Data
Text Content
# Get text
element.text
element.get_text()
# Get text with separator
element.get_text(separator=' ')
# Get stripped text
element.get_text(strip=True)
# Get strings (generator)
for string in element.stripped_strings:
print(string)
Attributes
# Get attribute
element['href']
element.get('href') # Returns None if missing
element.get('href', 'default') # With default
# Get all attributes
element.attrs # Returns dict
# Check attribute exists
element.has_attr('class')
HTML Content
# Inner HTML
str(element)
# Just the tag
element.name
# Prettified HTML
element.prettify()
DOM Navigation
Parent/Ancestors
element.parent
element.parents # Generator of all ancestors
# Find specific ancestor
for parent in element.parents:
if parent.name == 'div' and 'article' in parent.get('class', []):
break
Children
element.children # Direct children (generator)
list(element.children)
element.contents # Direct children (list)
element.descendants # All descendants (generator)
# Find in children
element.find('span') # Searches descendants
Siblings
element.next_sibling
element.previous_sibling
element.next_siblings # Generator
element.previous_siblings # Generator
# Next/previous element (skips whitespace)
element.next_element
element.previous_element
Data Extraction Patterns
Safe Extraction
def safe_text(element, selector, default=''):
"""Safely extract text from element."""
found = element.select_one(selector)
return found.get_text(strip=True) if found else default
def safe_attr(element, selector, attr, default=None):
"""Safely extract attribute from element."""
found = element.select_one(selector)
return found.get(attr, default) if found else default
Table Extraction
def extract_table(table):
"""Extract table data as list of dictionaries."""
headers = [th.get_text(strip=True) for th in table.select('th')]
rows = []
for tr in table.select('tbody tr'):
cells = [td.get_text(strip=True) for td in tr.select('td')]
if cells:
rows.append(dict(zip(headers, cells)))
return rows
List Extraction
def extract_items(soup, selector, extractor):
"""Extract multiple items using a custom extractor function."""
return [extractor(item) for item in soup.select(selector)]
# Usage
def extract_product(item):
return {
'name': safe_text(item, '.name'),
'price': safe_text(item, '.price'),
'url': safe_attr(item, 'a', 'href')
}
products = extract_items(soup, '.product', extract_product)
URL Resolution
from urllib.parse import urljoin
def resolve_url(base_url, relative_url):
"""Convert relative URL to absolute."""
if not relative_url:
return None
return urljoin(base_url, relative_url)
# Usage
base_url = 'https://example.com/products/'
for link in soup.select('a'):
href = link.get('href')
absolute_url = resolve_url(base_url, href)
print(absolute_url)
Handling Malformed HTML
# lxml parser is lenient with malformed HTML
soup = BeautifulSoup(malformed_html, 'lxml')
# For very broken HTML, use html5lib
soup = BeautifulSoup(very_broken_html, 'html5lib')
# Handle encoding issues
response = requests.get(url)
response.encoding = response.apparent_encoding
soup = BeautifulSoup(response.text, 'lxml')
Complete Scraping Example
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time
class ProductScraper:
def __init__(self, base_url):
self.base_url = base_url
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)'
})
def fetch_page(self, url):
"""Fetch and parse a page."""
response = self.session.get(url, timeout=30)
response.raise_for_status()
return BeautifulSoup(response.content, 'lxml')
def extract_product(self, item):
"""Extract product data from a card element."""
return {
'name': self._safe_text(item, '.product-title'),
'price': self._parse_price(item.select_one('.price')),
'rating': self._safe_attr(item, '.rating', 'data-rating'),
'image': self._resolve(self._safe_attr(item, 'img', 'src')),
'url': self._resolve(self._safe_attr(item, 'a', 'href')),
'in_stock': not item.select_one('.out-of-stock')
}
def scrape_products(self, url):
"""Scrape all products from a page."""
soup = self.fetch_page(url)
items = soup.select('.product-card')
return [self.extract_product(item) for item in items]
def _safe_text(self, element, selector, default=''):
found = element.select_one(selector)
return found.get_text(strip=True) if found else default
def _safe_attr(self, element, selector, attr, default=None):
found = element.select_one(selector)
return found.get(attr, default) if found else default
def _parse_price(self, element):
if not element:
return None
text = element.get_text(strip=True)
try:
return float(text.replace('$', '').replace(',', ''))
except ValueError:
return None
def _resolve(self, url):
return urljoin(self.base_url, url) if url else None
# Usage
scraper = ProductScraper('https://example.com')
products = scraper.scrape_products('https://example.com/products')
for product in products:
print(product)
Performance Optimization
# Use SoupStrainer to parse only needed elements
from bs4 import SoupStrainer
only_articles = SoupStrainer('article')
soup = BeautifulSoup(html, 'lxml', parse_only=only_articles)
# Use lxml parser for speed
soup = BeautifulSoup(html, 'lxml') # Fastest
# Decompose unneeded elements
for script in soup.find_all('script'):
script.decompose()
# Use generators for memory efficiency
for item in soup.select('.item'):
yield extract_data(item)
Key Dependencies
- beautifulsoup4
- lxml (fast parser)
- html5lib (lenient parser)
- requests
- pandas (for data output)
Best Practices
- Always use lxml parser for best performance
- Handle missing elements with default values
- Use
select()andselect_one()for CSS selectors - Use
get_text(strip=True)for clean text extraction - Resolve relative URLs to absolute
- Validate extracted data types
- Implement rate limiting between requests
- Use proper User-Agent headers
- Handle character encoding properly
- Use SoupStrainer for large documents
- Follow robots.txt and website terms of service
- Implement retry logic for failed requests
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.