normalize

by @grahama1970 in Tools

# Install this skill:

npx skills add grahama1970/agent-skills --skill "normalize"

Install specific skill from multi-skill repository

# Description

# SKILL.md

name: normalize
description: >
Normalize text to handle PDF/Unicode encoding issues.
Converts Windows-1252, curly quotes, em/en dashes, ligatures,
directional formatting, zero-width chars, and more to clean ASCII.
allowed-tools: Bash, Read
triggers:
- normalize text
- clean text
- normalize unicode
- fix encoding
- clean pdf text
- normalize pdf
metadata:
short-description: Clean PDF/Unicode text to ASCII
project-path: /home/graham/workspace/experiments/pi-mono

Text Normalize

Comprehensive text normalization for handling PDF and Unicode encoding issues.

Quick Start

# Normalize text from stdin
echo "Hello\u2019world" | .pi/skills/normalize/run.sh

# Normalize a file
.pi/skills/normalize/run.sh document.txt

# Normalize with output file
.pi/skills/normalize/run.sh document.txt -o clean.txt

# Treat argument as text (not filename)
.pi/skills/normalize/run.sh -t "Hello\u201cworld\u201d"

# Show statistics
.pi/skills/normalize/run.sh document.txt --stats

What It Normalizes

Category	Examples	Normalized To
Whitespace	Non-breaking, em/en space, hair space	Regular space
Hyphens	En dash, em dash, minus sign, figure dash	ASCII hyphen `-`
Quotes	Curly quotes, guillemets, primes	Straight `'` and `"`
Windows-1252	`\x93`, `\x94`, `\x92`	`"`, `"`, `'`
Ligatures	fi, fl, ffi, ffl	Expanded letters
Bullets	Various bullet points	Hyphen `-`
Zero-width	ZWSP, ZWNJ, ZWJ, BOM	Removed
Directional	LTR/RTL marks	Removed
Control chars	C0/C1 (except newline/tab)	Removed
Line breaks	`intro-\nduction`	`introduction`

Pipeline Integration

This skill is based on the same normalization used in the extractor pipeline's
s02_marker_extractor.py. The code is kept in sync with text_toolz patterns.

Python Usage

from normalize import normalize_text

# Clean text for pattern matching
text = "1.\u00a0Introduction"  # Non-breaking space
clean = normalize_text(text)   # "1. Introduction"

Normalization Steps

Windows-1252 conversion - Handle legacy MS Office encoding
NFKC normalization - Unicode compatibility decomposition
Remove directional formatting - LTR/RTL marks
Remove control characters - C0/C1 (preserve newlines)
Normalize whitespace - All special spaces to ASCII
Normalize hyphens - All dash variants to -
Normalize quotes - Curly to straight
Normalize dots - Ellipsis, leader dots
Normalize bullets - All bullet types to -
Expand ligatures - fi/fl/ffi/ffl
Fix line-break hyphens - Join hyphenated words
Collapse whitespace - Multiple spaces to single

Based On

text_toolz library patterns
extractor pipeline s02 normalization
NFKC Unicode standard

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.