mindrally

beautifulsoup-parsing

3
0
# Install this skill:
npx skills add Mindrally/skills --skill "beautifulsoup-parsing"

Install specific skill from multi-skill repository

# Description

Expert guidance for HTML/XML parsing using BeautifulSoup in Python with best practices for DOM navigation, data extraction, and efficient scraping workflows.

# SKILL.md


name: beautifulsoup-parsing
description: Expert guidance for HTML/XML parsing using BeautifulSoup in Python with best practices for DOM navigation, data extraction, and efficient scraping workflows.


BeautifulSoup HTML Parsing

You are an expert in BeautifulSoup, Python HTML/XML parsing, DOM navigation, and building efficient data extraction pipelines for web scraping.

Core Expertise

  • BeautifulSoup API and parsing methods
  • CSS selectors and find methods
  • DOM traversal and navigation
  • HTML/XML parsing with different parsers
  • Integration with requests library
  • Handling malformed HTML gracefully
  • Data extraction patterns and best practices
  • Memory-efficient processing

Key Principles

  • Write concise, technical code with accurate Python examples
  • Prioritize readability, efficiency, and maintainability
  • Use modular, reusable functions for common extraction tasks
  • Handle missing data gracefully with proper defaults
  • Follow PEP 8 style guidelines
  • Implement proper error handling for robust scraping

Basic Setup

pip install beautifulsoup4 requests lxml

Loading HTML

from bs4 import BeautifulSoup
import requests

# From string
html = '<html><body><h1>Hello</h1></body></html>'
soup = BeautifulSoup(html, 'lxml')

# From file
with open('page.html', 'r', encoding='utf-8') as f:
    soup = BeautifulSoup(f, 'lxml')

# From URL
response = requests.get('https://example.com')
soup = BeautifulSoup(response.content, 'lxml')

Parser Options

# lxml - Fast, lenient (recommended)
soup = BeautifulSoup(html, 'lxml')

# html.parser - Built-in, no dependencies
soup = BeautifulSoup(html, 'html.parser')

# html5lib - Most lenient, slowest
soup = BeautifulSoup(html, 'html5lib')

# lxml-xml - For XML documents
soup = BeautifulSoup(xml, 'lxml-xml')

Finding Elements

By Tag

# First matching element
soup.find('h1')

# All matching elements
soup.find_all('p')

# Shorthand
soup.h1  # Same as soup.find('h1')

By Attributes

# By class
soup.find('div', class_='article')
soup.find_all('div', class_='article')

# By ID
soup.find(id='main-content')

# By any attribute
soup.find('a', href='https://example.com')
soup.find_all('input', attrs={'type': 'text', 'name': 'email'})

# By data attributes
soup.find('div', attrs={'data-id': '123'})

CSS Selectors

# Single element
soup.select_one('div.article > h2')

# Multiple elements
soup.select('div.article h2')

# Complex selectors
soup.select('a[href^="https://"]')  # Starts with
soup.select('a[href$=".pdf"]')      # Ends with
soup.select('a[href*="example"]')   # Contains
soup.select('li:nth-child(2)')
soup.select('h1, h2, h3')           # Multiple

With Functions

import re

# By regex
soup.find_all('a', href=re.compile(r'^https://'))

# By function
def has_data_attr(tag):
    return tag.has_attr('data-id')

soup.find_all(has_data_attr)

# String matching
soup.find_all(string='exact text')
soup.find_all(string=re.compile('pattern'))

Extracting Data

Text Content

# Get text
element.text
element.get_text()

# Get text with separator
element.get_text(separator=' ')

# Get stripped text
element.get_text(strip=True)

# Get strings (generator)
for string in element.stripped_strings:
    print(string)

Attributes

# Get attribute
element['href']
element.get('href')  # Returns None if missing
element.get('href', 'default')  # With default

# Get all attributes
element.attrs  # Returns dict

# Check attribute exists
element.has_attr('class')

HTML Content

# Inner HTML
str(element)

# Just the tag
element.name

# Prettified HTML
element.prettify()

DOM Navigation

Parent/Ancestors

element.parent
element.parents  # Generator of all ancestors

# Find specific ancestor
for parent in element.parents:
    if parent.name == 'div' and 'article' in parent.get('class', []):
        break

Children

element.children      # Direct children (generator)
list(element.children)

element.contents      # Direct children (list)
element.descendants   # All descendants (generator)

# Find in children
element.find('span')  # Searches descendants

Siblings

element.next_sibling
element.previous_sibling

element.next_siblings      # Generator
element.previous_siblings  # Generator

# Next/previous element (skips whitespace)
element.next_element
element.previous_element

Data Extraction Patterns

Safe Extraction

def safe_text(element, selector, default=''):
    """Safely extract text from element."""
    found = element.select_one(selector)
    return found.get_text(strip=True) if found else default

def safe_attr(element, selector, attr, default=None):
    """Safely extract attribute from element."""
    found = element.select_one(selector)
    return found.get(attr, default) if found else default

Table Extraction

def extract_table(table):
    """Extract table data as list of dictionaries."""
    headers = [th.get_text(strip=True) for th in table.select('th')]

    rows = []
    for tr in table.select('tbody tr'):
        cells = [td.get_text(strip=True) for td in tr.select('td')]
        if cells:
            rows.append(dict(zip(headers, cells)))

    return rows

List Extraction

def extract_items(soup, selector, extractor):
    """Extract multiple items using a custom extractor function."""
    return [extractor(item) for item in soup.select(selector)]

# Usage
def extract_product(item):
    return {
        'name': safe_text(item, '.name'),
        'price': safe_text(item, '.price'),
        'url': safe_attr(item, 'a', 'href')
    }

products = extract_items(soup, '.product', extract_product)

URL Resolution

from urllib.parse import urljoin

def resolve_url(base_url, relative_url):
    """Convert relative URL to absolute."""
    if not relative_url:
        return None
    return urljoin(base_url, relative_url)

# Usage
base_url = 'https://example.com/products/'
for link in soup.select('a'):
    href = link.get('href')
    absolute_url = resolve_url(base_url, href)
    print(absolute_url)

Handling Malformed HTML

# lxml parser is lenient with malformed HTML
soup = BeautifulSoup(malformed_html, 'lxml')

# For very broken HTML, use html5lib
soup = BeautifulSoup(very_broken_html, 'html5lib')

# Handle encoding issues
response = requests.get(url)
response.encoding = response.apparent_encoding
soup = BeautifulSoup(response.text, 'lxml')

Complete Scraping Example

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time

class ProductScraper:
    def __init__(self, base_url):
        self.base_url = base_url
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (compatible; MyScraper/1.0)'
        })

    def fetch_page(self, url):
        """Fetch and parse a page."""
        response = self.session.get(url, timeout=30)
        response.raise_for_status()
        return BeautifulSoup(response.content, 'lxml')

    def extract_product(self, item):
        """Extract product data from a card element."""
        return {
            'name': self._safe_text(item, '.product-title'),
            'price': self._parse_price(item.select_one('.price')),
            'rating': self._safe_attr(item, '.rating', 'data-rating'),
            'image': self._resolve(self._safe_attr(item, 'img', 'src')),
            'url': self._resolve(self._safe_attr(item, 'a', 'href')),
            'in_stock': not item.select_one('.out-of-stock')
        }

    def scrape_products(self, url):
        """Scrape all products from a page."""
        soup = self.fetch_page(url)
        items = soup.select('.product-card')
        return [self.extract_product(item) for item in items]

    def _safe_text(self, element, selector, default=''):
        found = element.select_one(selector)
        return found.get_text(strip=True) if found else default

    def _safe_attr(self, element, selector, attr, default=None):
        found = element.select_one(selector)
        return found.get(attr, default) if found else default

    def _parse_price(self, element):
        if not element:
            return None
        text = element.get_text(strip=True)
        try:
            return float(text.replace('$', '').replace(',', ''))
        except ValueError:
            return None

    def _resolve(self, url):
        return urljoin(self.base_url, url) if url else None


# Usage
scraper = ProductScraper('https://example.com')
products = scraper.scrape_products('https://example.com/products')
for product in products:
    print(product)

Performance Optimization

# Use SoupStrainer to parse only needed elements
from bs4 import SoupStrainer

only_articles = SoupStrainer('article')
soup = BeautifulSoup(html, 'lxml', parse_only=only_articles)

# Use lxml parser for speed
soup = BeautifulSoup(html, 'lxml')  # Fastest

# Decompose unneeded elements
for script in soup.find_all('script'):
    script.decompose()

# Use generators for memory efficiency
for item in soup.select('.item'):
    yield extract_data(item)

Key Dependencies

  • beautifulsoup4
  • lxml (fast parser)
  • html5lib (lenient parser)
  • requests
  • pandas (for data output)

Best Practices

  1. Always use lxml parser for best performance
  2. Handle missing elements with default values
  3. Use select() and select_one() for CSS selectors
  4. Use get_text(strip=True) for clean text extraction
  5. Resolve relative URLs to absolute
  6. Validate extracted data types
  7. Implement rate limiting between requests
  8. Use proper User-Agent headers
  9. Handle character encoding properly
  10. Use SoupStrainer for large documents
  11. Follow robots.txt and website terms of service
  12. Implement retry logic for failed requests

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.