speak-tts

by @emzod in AI & LLM

# Install this skill:

npx skills add EmZod/speak

Or install specific skill: npx add-skill https://github.com/EmZod/speak

# Description

Give your agent the ability to speak to you real-time. Talk to your Claude! Local TTS, text-to-speech, voice synthesis, audio generation with voice cloning on Apple Silicon. Use for reading articles aloud, audiobook narration, or voice responses. Runs entirely on-device via MLX - private, no API keys.

# SKILL.md

name: speak-tts
description: Give your agent the ability to speak to you real-time. Talk to your Claude! Local TTS, text-to-speech, voice synthesis, audio generation with voice cloning on Apple Silicon. Use for reading articles aloud, audiobook narration, or voice responses. Runs entirely on-device via MLX - private, no API keys.

speak - Talk to your Claude!

Give your agent the ability to speak to you real-time. Local text-to-speech, voice cloning, and audio generation on Apple Silicon.
Give your agent the ability to speak to you real-time. Local TTS with voice cloning on Apple Silicon.

Prerequisites

Requirement	Check	Install
Apple Silicon Mac	`uname -m` → arm64	Intel not supported
macOS 12.0+	`sw_vers`	-
sox	`which sox`	`brew install sox`
ffmpeg	`which ffmpeg`	`brew install ffmpeg`
poppler (PDF)	`which pdftotext`	`brew install poppler`

Input Sources

Source	Example
Text file	`speak article.txt`
Markdown	`speak doc.md`
Direct string	`speak "Hello"`
Clipboard	`pbpaste \\| speak`
Stdin	`cat file.txt \\| speak`

Web Articles

lynx -dump -nolist "https://example.com/article" | speak --output article.wav

Converting Formats

Format	Convert Command
PDF	`pdftotext doc.pdf doc.txt`
DOCX	`textutil -convert txt doc.docx`
HTML	`pandoc -f html -t plain doc.html > doc.txt`

Output Modes

Goal	Command
Save for later	`speak text.txt --output file.wav`
Listen now (streaming)	`speak text.txt --stream`
Listen now (complete)	`speak text.txt --play`
Both	`speak text.txt --stream --output file.wav`

Default Behavior

speak article.txt          # → ~/Audio/speak/article.wav (no playback)
speak "Hello"              # → ~/Audio/speak/speak_<timestamp>.wav

Directory Auto-Creation

Directory	Auto-Created?
`~/Audio/speak/`	✓ Yes
`~/.chatter/voices/`	✗ No
Custom directories	✗ No

Always create custom directories first:

mkdir -p ~/.chatter/voices/
mkdir -p ~/Audio/custom/

Voice Cloning

Voice cloning generates speech that matches your vocal characteristics (pitch, tone, cadence) from a short recording.

Quality Expectations

Output captures general voice characteristics but is not a perfect replica
Quality depends heavily on sample quality
15-25 seconds is optimal (10s minimum, 30s maximum)

Recording Your Voice

Using QuickTime:
1. Open QuickTime Player → File → New Audio Recording
2. Record 20 seconds of clear speech
3. File → Export As → Audio Only (.m4a)
4. Convert to WAV (see below)

Using sox (command line):

# -d = use default microphone
# Recording starts immediately and stops after 25 seconds
sox -d -r 24000 -c 1 ~/.chatter/voices/my_voice.wav trim 0 25

Converting to Required Format

Voice samples MUST be: WAV, 24000 Hz, mono, 10-30 seconds.

# From MP3
ffmpeg -i voice.mp3 -ar 24000 -ac 1 voice.wav

# From M4A (QuickTime)
ffmpeg -i voice.m4a -ar 24000 -ac 1 voice.wav

# Trim to 25 seconds
ffmpeg -i long.wav -t 25 -ar 24000 -ac 1 trimmed.wav

# Check sample properties
ffprobe -i voice.wav 2>&1 | grep -E "Duration|Stream"
# Should show: Duration ~15-25s, 24000 Hz, mono

Using Your Voice

# Create directory
mkdir -p ~/.chatter/voices/

# Move sample
mv voice.wav ~/.chatter/voices/my_voice.wav

# Test
speak "Testing my voice" --voice ~/.chatter/voices/my_voice.wav --stream

# Use for content
speak notes.txt --voice ~/.chatter/voices/my_voice.wav --output presentation.wav

Path requirements:
- ✓ Works: ~/.chatter/voices/my_voice.wav (tilde expanded by shell)
- ✓ Works: /Users/name/.chatter/voices/my_voice.wav
- ✗ Fails: my_voice.wav (relative path)
- ✗ Fails: ./voices/my_voice.wav (relative path)

Voice Sample Tips

Good Sample	Bad Sample
Quiet room	Background noise
Natural pace	Rushed or monotone
Clear diction	Mumbling
Varied content	Repetitive phrases

Default Voice

When --voice is omitted, a built-in default voice is used:

speak "Hello world" --stream  # Uses default voice

Emotion Tags

Tags produce audible effects (actual sounds), not spoken words:

speak "[sigh] Monday again." --stream
# Output: (sigh sound) "Monday again."

Tag	Effect
`[laugh]`	Laughter
`[chuckle]`	Light chuckle
`[sigh]`	Sighing
`[gasp]`	Gasping
`[groan]`	Groaning
`[clear throat]`	Throat clearing
`[cough]`	Coughing
`[crying]`	Crying
`[singing]`	Sung speech

NOT supported: [pause], [whisper] (ignored)

For pauses: Use punctuation: "Wait... let me think."

Batch Processing

mkdir -p ~/Audio/book/
speak ch01.txt ch02.txt ch03.txt --output-dir ~/Audio/book/
# Creates: ch01.wav, ch02.wav, ch03.wav

# With auto-chunking (for long files)
speak chapters/*.txt --output-dir ~/Audio/book/ --auto-chunk

# Skip completed files
speak chapters/*.txt --output-dir ~/Audio/book/ --skip-existing

Auto-Chunk Behavior

When using --auto-chunk with batch processing:
1. Each input file is chunked independently
2. Chunks are generated and automatically concatenated per file
3. Final output: one .wav per input file (e.g., ch01.wav)
4. Intermediate chunks deleted (unless --keep-chunks)

You don't need to manually concatenate chunks — only concatenate final chapter files.

Concatenating Audio

# Explicit order (recommended)
speak concat ch01.wav ch02.wav ch03.wav --output book.wav

# Glob pattern (REQUIRES zero-padded filenames)
speak concat audiobook/*.wav --output book.wav

Zero-Padding Rules

Critical for correct concatenation order:

Files	Correct	Wrong
1-9	`01`, `02`, ..., `09`	`1`, `2`, ..., `9`
10-99	`01`, `02`, ..., `99`	`1`, `10`, `2`, ...
100+	`001`, `002`, ..., `999`	`1`, `100`, `2`, ...

Why: Shell glob expansion sorts alphabetically. 1, 10, 2 vs 01, 02, 10.

PDF to Audiobook (Complete Workflow)

Step 1: Find Chapter Boundaries

# Preview table of contents
pdftotext -f 1 -l 5 textbook.pdf toc.txt
cat toc.txt  # Note chapter page numbers

# Or search for "Chapter" markers
pdftotext textbook.pdf - | grep -n "Chapter"

Step 2: Extract Chapters (Zero-Padded!)

# For 100-page book with ~10 chapters
pdftotext -f 1 -l 12 -layout textbook.pdf ch01.txt
pdftotext -f 13 -l 25 -layout textbook.pdf ch02.txt
pdftotext -f 26 -l 38 -layout textbook.pdf ch03.txt
# ... continue for all chapters

Step 3: Estimate Time

speak --estimate ch*.txt
# Shows: total audio duration, generation time, storage needed

# Quick estimates:
# 1 page ≈ 2 min audio ≈ 1 min generation
# 100 pages ≈ 200 min audio ≈ 100 min generation ≈ 500 MB

Step 4: Generate Audio

mkdir -p audiobook/
speak ch01.txt ch02.txt ch03.txt --output-dir audiobook/ --auto-chunk
# Creates: audiobook/ch01.wav, audiobook/ch02.wav, audiobook/ch03.wav

Step 5: Concatenate

speak concat audiobook/ch01.wav audiobook/ch02.wav audiobook/ch03.wav --output complete_audiobook.wav
# Or with glob (only if zero-padded):
speak concat audiobook/ch*.wav --output complete_audiobook.wav

PDF Troubleshooting

Issue	Solution
Empty/garbled text	Scanned PDF — use OCR: `brew install tesseract`
Wrong encoding	Try: `pdftotext -enc UTF-8 doc.pdf`
Check word count	`pdftotext doc.pdf - \\| wc -w` (should be >100)

Multi-Voice Content

mkdir -p podcast/scripts podcast/wav

echo "Welcome to the show." > podcast/scripts/01_host.txt
echo "Thanks for having me." > podcast/scripts/02_guest.txt

speak podcast/scripts/01_host.txt --voice ~/.chatter/voices/host.wav --output podcast/wav/01.wav
speak podcast/scripts/02_guest.txt --voice ~/.chatter/voices/guest.wav --output podcast/wav/02.wav

speak concat podcast/wav/01.wav podcast/wav/02.wav --output podcast.wav

Options Reference

Option	Description	Default
`--stream`	Stream as it generates	false
`--play`	Play after complete	false
`--output <path>`	Output file	~/Audio/speak/
`--output-dir <dir>`	Batch output directory	-
`--voice <path>`	Voice sample (full path)	default
`--timeout <sec>`	Timeout per file	300
`--auto-chunk`	Split long documents	false
`--chunk-size <n>`	Chars per chunk	6000
`--resume <file>`	Resume from manifest	-
`--keep-chunks`	Keep intermediate files	false
`--skip-existing`	Skip if output exists	false
`--estimate`	Show duration estimate	false
`--dry-run`	Preview only	false
`--quiet`	Suppress output	false

Commands

Command	Description
`speak setup`	Set up environment
`speak health`	Check system status
`speak models`	List TTS models
`speak concat`	Concatenate audio
`speak daemon kill`	Stop TTS server
`speak config`	Show configuration

Performance

Metric	Value
Cold start	~4-8s
Warm start	~3-8s
Speed	0.3-0.5x RTF (faster than real-time)
Storage	~2.5 MB/min, ~150 MB/hour

Resume Capability

For interrupted long generations:

# Single file with auto-chunk — use --resume
speak long.txt --auto-chunk --output book.wav
# If interrupted, manifest saved at ~/Audio/speak/manifest.json
speak --resume ~/Audio/speak/manifest.json

# Batch processing — use --skip-existing
speak ch*.txt --output-dir audiobook/ --auto-chunk
# If interrupted, re-run same command:
speak ch*.txt --output-dir audiobook/ --auto-chunk --skip-existing

Common Errors

Error	Cause	Solution
"Voice file not found"	Relative path	Use full path: `~/.chatter/voices/x.wav`
"Invalid WAV format"	Wrong specs	Convert: `ffmpeg -i in.wav -ar 24000 -ac 1 out.wav`
"Voice sample too short"	<10 seconds	Record 15-25 seconds
"Output directory doesn't exist"	Not created	`mkdir -p dirname/`
"sox not found"	Not installed	`brew install sox`
Scrambled concat order	Non-zero-padded	Use `01`, `02`, not `1`, `2`
Timeout	>5 min generation	Use `--auto-chunk` or `--timeout 600`
"Server not running"	Stale daemon	`speak daemon kill && speak health`

Setup

speak "test"     # Auto-setup on first run (downloads model ~500MB)
speak setup      # Or manual setup
speak health     # Verify everything works

Server Management

Server auto-starts and shuts down after 1 hour idle.

speak health        # Check status
speak daemon kill   # Stop manually

# README.md

                          ███████╗██████╗ ███████╗ █████╗ ██╗  ██╗
                          ██╔════╝██╔══██╗██╔════╝██╔══██╗██║ ██╔╝
                          ███████╗██████╔╝█████╗  ███████║█████╔╝ 
                          ╚════██║██╔═══╝ ██╔══╝  ██╔══██║██╔═██╗ 
                          ███████║██║     ███████╗██║  ██║██║  ██╗
                          ╚══════╝╚═╝     ╚══════╝╚═╝  ╚═╝╚═╝  ╚═╝

Talk to your Claude.

Voice cloning. Long documents. Audiobook quality. Local & private.

speak article.md --stream → Audio starts in seconds

Install

For AI Agents (Claude Code, Cursor, Windsurf):

npx skills add EmZod/speak

CLI:

git clone https://github.com/EmZod/speak.git
cd speak && bun install
alias speak="bun run $(pwd)/src/index.ts"

Requirements: macOS Apple Silicon · Bun · Python 3.10+ · sox (brew install sox)

Usage

speak "Hello, world!" --play        # Generate and play
speak article.md --stream           # Stream long content  
speak document.md --output out.wav  # Save to file
speak --clipboard --play            # Read from clipboard

Voice Cloning

Clone any voice from a 10-30 second sample:

# Use your cloned voice
speak "Hello" --voice ~/.chatter/voices/morgan_freeman.wav --play

Long Documents

speak book.md --auto-chunk --output book.wav    # Auto-chunk for reliability
speak --resume manifest.json                     # Resume interrupted generation
speak *.md --output-dir ~/Audio/                 # Batch processing
speak --estimate document.md                     # Estimate duration first

Commands

speak <text|file>      Generate speech
speak health           Check system status
speak models           List available models
speak concat <files>   Combine audio files
speak daemon kill      Stop TTS server

Options

--play          Play after generation
--stream        Stream as it generates
--output        Output file or directory
--voice         Custom voice file (WAV)
--auto-chunk    Chunk long documents
--estimate      Show duration estimate
--dry-run       Preview without generating

Performance

Long documents     ████████████████████  Streaming, auto-chunk
Voice cloning      ████████████████████  Any voice from sample
Emotion tags       ████████████████████  [laugh], [sigh], etc.
Quality            ████████████████████  Audiobook grade

Documentation

File	Content
SKILL.md	Full usage guide for agents
docs/usage.md	Complete CLI reference
docs/troubleshooting.md	Common issues & fixes
AGENTS.md	Architecture & development

_{MIT License · Built on Chatterbox TTS}

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.

speak-tts

# Description

# SKILL.md

speak - Talk to your Claude!

Prerequisites

Input Sources

Web Articles

Converting Formats

Output Modes

Default Behavior

Directory Auto-Creation

Voice Cloning

Quality Expectations

Recording Your Voice

Converting to Required Format

Using Your Voice

Voice Sample Tips

Default Voice

Emotion Tags

Batch Processing

Auto-Chunk Behavior

Concatenating Audio

Zero-Padding Rules

PDF to Audiobook (Complete Workflow)

Step 1: Find Chapter Boundaries

Step 2: Extract Chapters (Zero-Padded!)

Step 3: Estimate Time

Step 4: Generate Audio

Step 5: Concatenate

PDF Troubleshooting

Multi-Voice Content

Options Reference

Commands

Performance

Resume Capability

Common Errors

Setup

Server Management

# README.md

Talk to your Claude.

Install

Usage

Voice Cloning

Long Documents

Commands

Options

Performance

See Also

Documentation

# Related Skills

# Supported AI Coding Agents

Confirm

Submit a Skill