Use when you have a written implementation plan to execute in a separate session with review checkpoints
npx skills add grahama1970/agent-skills --skill "normalize"
Install specific skill from multi-skill repository
# Description
>
# SKILL.md
name: normalize
description: >
Normalize text to handle PDF/Unicode encoding issues.
Converts Windows-1252, curly quotes, em/en dashes, ligatures,
directional formatting, zero-width chars, and more to clean ASCII.
allowed-tools: Bash, Read
triggers:
- normalize text
- clean text
- normalize unicode
- fix encoding
- clean pdf text
- normalize pdf
metadata:
short-description: Clean PDF/Unicode text to ASCII
project-path: /home/graham/workspace/experiments/pi-mono
Text Normalize
Comprehensive text normalization for handling PDF and Unicode encoding issues.
Quick Start
# Normalize text from stdin
echo "Hello\u2019world" | .pi/skills/normalize/run.sh
# Normalize a file
.pi/skills/normalize/run.sh document.txt
# Normalize with output file
.pi/skills/normalize/run.sh document.txt -o clean.txt
# Treat argument as text (not filename)
.pi/skills/normalize/run.sh -t "Hello\u201cworld\u201d"
# Show statistics
.pi/skills/normalize/run.sh document.txt --stats
What It Normalizes
| Category | Examples | Normalized To |
|---|---|---|
| Whitespace | Non-breaking, em/en space, hair space | Regular space |
| Hyphens | En dash, em dash, minus sign, figure dash | ASCII hyphen - |
| Quotes | Curly quotes, guillemets, primes | Straight ' and " |
| Windows-1252 | \x93, \x94, \x92 |
", ", ' |
| Ligatures | fi, fl, ffi, ffl | Expanded letters |
| Bullets | Various bullet points | Hyphen - |
| Zero-width | ZWSP, ZWNJ, ZWJ, BOM | Removed |
| Directional | LTR/RTL marks | Removed |
| Control chars | C0/C1 (except newline/tab) | Removed |
| Line breaks | intro-\nduction |
introduction |
Pipeline Integration
This skill is based on the same normalization used in the extractor pipeline's
s02_marker_extractor.py. The code is kept in sync with text_toolz patterns.
Python Usage
from normalize import normalize_text
# Clean text for pattern matching
text = "1.\u00a0Introduction" # Non-breaking space
clean = normalize_text(text) # "1. Introduction"
Normalization Steps
- Windows-1252 conversion - Handle legacy MS Office encoding
- NFKC normalization - Unicode compatibility decomposition
- Remove directional formatting - LTR/RTL marks
- Remove control characters - C0/C1 (preserve newlines)
- Normalize whitespace - All special spaces to ASCII
- Normalize hyphens - All dash variants to
- - Normalize quotes - Curly to straight
- Normalize dots - Ellipsis, leader dots
- Normalize bullets - All bullet types to
- - Expand ligatures - fi/fl/ffi/ffl
- Fix line-break hyphens - Join hyphenated words
- Collapse whitespace - Multiple spaces to single
Based On
- text_toolz library patterns
- extractor pipeline s02 normalization
- NFKC Unicode standard
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.