Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add EmZod/speak
Or install specific skill: npx add-skill https://github.com/EmZod/speak
# Description
Give your agent the ability to speak to you real-time. Talk to your Claude! Local TTS, text-to-speech, voice synthesis, audio generation with voice cloning on Apple Silicon. Use for reading articles aloud, audiobook narration, or voice responses. Runs entirely on-device via MLX - private, no API keys.
# SKILL.md
name: speak-tts
description: Give your agent the ability to speak to you real-time. Talk to your Claude! Local TTS, text-to-speech, voice synthesis, audio generation with voice cloning on Apple Silicon. Use for reading articles aloud, audiobook narration, or voice responses. Runs entirely on-device via MLX - private, no API keys.
speak - Talk to your Claude!
Give your agent the ability to speak to you real-time. Local text-to-speech, voice cloning, and audio generation on Apple Silicon.
Give your agent the ability to speak to you real-time. Local TTS with voice cloning on Apple Silicon.
Prerequisites
| Requirement | Check | Install |
|---|---|---|
| Apple Silicon Mac | uname -m → arm64 |
Intel not supported |
| macOS 12.0+ | sw_vers |
- |
| sox | which sox |
brew install sox |
| ffmpeg | which ffmpeg |
brew install ffmpeg |
| poppler (PDF) | which pdftotext |
brew install poppler |
Input Sources
| Source | Example |
|---|---|
| Text file | speak article.txt |
| Markdown | speak doc.md |
| Direct string | speak "Hello" |
| Clipboard | pbpaste \| speak |
| Stdin | cat file.txt \| speak |
Web Articles
lynx -dump -nolist "https://example.com/article" | speak --output article.wav
Converting Formats
| Format | Convert Command |
|---|---|
pdftotext doc.pdf doc.txt |
|
| DOCX | textutil -convert txt doc.docx |
| HTML | pandoc -f html -t plain doc.html > doc.txt |
Output Modes
| Goal | Command |
|---|---|
| Save for later | speak text.txt --output file.wav |
| Listen now (streaming) | speak text.txt --stream |
| Listen now (complete) | speak text.txt --play |
| Both | speak text.txt --stream --output file.wav |
Default Behavior
speak article.txt # → ~/Audio/speak/article.wav (no playback)
speak "Hello" # → ~/Audio/speak/speak_<timestamp>.wav
Directory Auto-Creation
| Directory | Auto-Created? |
|---|---|
~/Audio/speak/ |
✓ Yes |
~/.chatter/voices/ |
✗ No |
| Custom directories | ✗ No |
Always create custom directories first:
mkdir -p ~/.chatter/voices/
mkdir -p ~/Audio/custom/
Voice Cloning
Voice cloning generates speech that matches your vocal characteristics (pitch, tone, cadence) from a short recording.
Quality Expectations
- Output captures general voice characteristics but is not a perfect replica
- Quality depends heavily on sample quality
- 15-25 seconds is optimal (10s minimum, 30s maximum)
Recording Your Voice
Using QuickTime:
1. Open QuickTime Player → File → New Audio Recording
2. Record 20 seconds of clear speech
3. File → Export As → Audio Only (.m4a)
4. Convert to WAV (see below)
Using sox (command line):
# -d = use default microphone
# Recording starts immediately and stops after 25 seconds
sox -d -r 24000 -c 1 ~/.chatter/voices/my_voice.wav trim 0 25
Converting to Required Format
Voice samples MUST be: WAV, 24000 Hz, mono, 10-30 seconds.
# From MP3
ffmpeg -i voice.mp3 -ar 24000 -ac 1 voice.wav
# From M4A (QuickTime)
ffmpeg -i voice.m4a -ar 24000 -ac 1 voice.wav
# Trim to 25 seconds
ffmpeg -i long.wav -t 25 -ar 24000 -ac 1 trimmed.wav
# Check sample properties
ffprobe -i voice.wav 2>&1 | grep -E "Duration|Stream"
# Should show: Duration ~15-25s, 24000 Hz, mono
Using Your Voice
# Create directory
mkdir -p ~/.chatter/voices/
# Move sample
mv voice.wav ~/.chatter/voices/my_voice.wav
# Test
speak "Testing my voice" --voice ~/.chatter/voices/my_voice.wav --stream
# Use for content
speak notes.txt --voice ~/.chatter/voices/my_voice.wav --output presentation.wav
Path requirements:
- ✓ Works: ~/.chatter/voices/my_voice.wav (tilde expanded by shell)
- ✓ Works: /Users/name/.chatter/voices/my_voice.wav
- ✗ Fails: my_voice.wav (relative path)
- ✗ Fails: ./voices/my_voice.wav (relative path)
Voice Sample Tips
| Good Sample | Bad Sample |
|---|---|
| Quiet room | Background noise |
| Natural pace | Rushed or monotone |
| Clear diction | Mumbling |
| Varied content | Repetitive phrases |
Default Voice
When --voice is omitted, a built-in default voice is used:
speak "Hello world" --stream # Uses default voice
Emotion Tags
Tags produce audible effects (actual sounds), not spoken words:
speak "[sigh] Monday again." --stream
# Output: (sigh sound) "Monday again."
| Tag | Effect |
|---|---|
[laugh] |
Laughter |
[chuckle] |
Light chuckle |
[sigh] |
Sighing |
[gasp] |
Gasping |
[groan] |
Groaning |
[clear throat] |
Throat clearing |
[cough] |
Coughing |
[crying] |
Crying |
[singing] |
Sung speech |
NOT supported: [pause], [whisper] (ignored)
For pauses: Use punctuation: "Wait... let me think."
Batch Processing
mkdir -p ~/Audio/book/
speak ch01.txt ch02.txt ch03.txt --output-dir ~/Audio/book/
# Creates: ch01.wav, ch02.wav, ch03.wav
# With auto-chunking (for long files)
speak chapters/*.txt --output-dir ~/Audio/book/ --auto-chunk
# Skip completed files
speak chapters/*.txt --output-dir ~/Audio/book/ --skip-existing
Auto-Chunk Behavior
When using --auto-chunk with batch processing:
1. Each input file is chunked independently
2. Chunks are generated and automatically concatenated per file
3. Final output: one .wav per input file (e.g., ch01.wav)
4. Intermediate chunks deleted (unless --keep-chunks)
You don't need to manually concatenate chunks — only concatenate final chapter files.
Concatenating Audio
# Explicit order (recommended)
speak concat ch01.wav ch02.wav ch03.wav --output book.wav
# Glob pattern (REQUIRES zero-padded filenames)
speak concat audiobook/*.wav --output book.wav
Zero-Padding Rules
Critical for correct concatenation order:
| Files | Correct | Wrong |
|---|---|---|
| 1-9 | 01, 02, ..., 09 |
1, 2, ..., 9 |
| 10-99 | 01, 02, ..., 99 |
1, 10, 2, ... |
| 100+ | 001, 002, ..., 999 |
1, 100, 2, ... |
Why: Shell glob expansion sorts alphabetically. 1, 10, 2 vs 01, 02, 10.
PDF to Audiobook (Complete Workflow)
Step 1: Find Chapter Boundaries
# Preview table of contents
pdftotext -f 1 -l 5 textbook.pdf toc.txt
cat toc.txt # Note chapter page numbers
# Or search for "Chapter" markers
pdftotext textbook.pdf - | grep -n "Chapter"
Step 2: Extract Chapters (Zero-Padded!)
# For 100-page book with ~10 chapters
pdftotext -f 1 -l 12 -layout textbook.pdf ch01.txt
pdftotext -f 13 -l 25 -layout textbook.pdf ch02.txt
pdftotext -f 26 -l 38 -layout textbook.pdf ch03.txt
# ... continue for all chapters
Step 3: Estimate Time
speak --estimate ch*.txt
# Shows: total audio duration, generation time, storage needed
# Quick estimates:
# 1 page ≈ 2 min audio ≈ 1 min generation
# 100 pages ≈ 200 min audio ≈ 100 min generation ≈ 500 MB
Step 4: Generate Audio
mkdir -p audiobook/
speak ch01.txt ch02.txt ch03.txt --output-dir audiobook/ --auto-chunk
# Creates: audiobook/ch01.wav, audiobook/ch02.wav, audiobook/ch03.wav
Step 5: Concatenate
speak concat audiobook/ch01.wav audiobook/ch02.wav audiobook/ch03.wav --output complete_audiobook.wav
# Or with glob (only if zero-padded):
speak concat audiobook/ch*.wav --output complete_audiobook.wav
PDF Troubleshooting
| Issue | Solution |
|---|---|
| Empty/garbled text | Scanned PDF — use OCR: brew install tesseract |
| Wrong encoding | Try: pdftotext -enc UTF-8 doc.pdf |
| Check word count | pdftotext doc.pdf - \| wc -w (should be >100) |
Multi-Voice Content
mkdir -p podcast/scripts podcast/wav
echo "Welcome to the show." > podcast/scripts/01_host.txt
echo "Thanks for having me." > podcast/scripts/02_guest.txt
speak podcast/scripts/01_host.txt --voice ~/.chatter/voices/host.wav --output podcast/wav/01.wav
speak podcast/scripts/02_guest.txt --voice ~/.chatter/voices/guest.wav --output podcast/wav/02.wav
speak concat podcast/wav/01.wav podcast/wav/02.wav --output podcast.wav
Options Reference
| Option | Description | Default |
|---|---|---|
--stream |
Stream as it generates | false |
--play |
Play after complete | false |
--output <path> |
Output file | ~/Audio/speak/ |
--output-dir <dir> |
Batch output directory | - |
--voice <path> |
Voice sample (full path) | default |
--timeout <sec> |
Timeout per file | 300 |
--auto-chunk |
Split long documents | false |
--chunk-size <n> |
Chars per chunk | 6000 |
--resume <file> |
Resume from manifest | - |
--keep-chunks |
Keep intermediate files | false |
--skip-existing |
Skip if output exists | false |
--estimate |
Show duration estimate | false |
--dry-run |
Preview only | false |
--quiet |
Suppress output | false |
Commands
| Command | Description |
|---|---|
speak setup |
Set up environment |
speak health |
Check system status |
speak models |
List TTS models |
speak concat |
Concatenate audio |
speak daemon kill |
Stop TTS server |
speak config |
Show configuration |
Performance
| Metric | Value |
|---|---|
| Cold start | ~4-8s |
| Warm start | ~3-8s |
| Speed | 0.3-0.5x RTF (faster than real-time) |
| Storage | ~2.5 MB/min, ~150 MB/hour |
Resume Capability
For interrupted long generations:
# Single file with auto-chunk — use --resume
speak long.txt --auto-chunk --output book.wav
# If interrupted, manifest saved at ~/Audio/speak/manifest.json
speak --resume ~/Audio/speak/manifest.json
# Batch processing — use --skip-existing
speak ch*.txt --output-dir audiobook/ --auto-chunk
# If interrupted, re-run same command:
speak ch*.txt --output-dir audiobook/ --auto-chunk --skip-existing
Common Errors
| Error | Cause | Solution |
|---|---|---|
| "Voice file not found" | Relative path | Use full path: ~/.chatter/voices/x.wav |
| "Invalid WAV format" | Wrong specs | Convert: ffmpeg -i in.wav -ar 24000 -ac 1 out.wav |
| "Voice sample too short" | <10 seconds | Record 15-25 seconds |
| "Output directory doesn't exist" | Not created | mkdir -p dirname/ |
| "sox not found" | Not installed | brew install sox |
| Scrambled concat order | Non-zero-padded | Use 01, 02, not 1, 2 |
| Timeout | >5 min generation | Use --auto-chunk or --timeout 600 |
| "Server not running" | Stale daemon | speak daemon kill && speak health |
Setup
speak "test" # Auto-setup on first run (downloads model ~500MB)
speak setup # Or manual setup
speak health # Verify everything works
Server Management
Server auto-starts and shuts down after 1 hour idle.
speak health # Check status
speak daemon kill # Stop manually
# README.md
███████╗██████╗ ███████╗ █████╗ ██╗ ██╗
██╔════╝██╔══██╗██╔════╝██╔══██╗██║ ██╔╝
███████╗██████╔╝█████╗ ███████║█████╔╝
╚════██║██╔═══╝ ██╔══╝ ██╔══██║██╔═██╗
███████║██║ ███████╗██║ ██║██║ ██╗
╚══════╝╚═╝ ╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝
Talk to your Claude.
Voice cloning. Long documents. Audiobook quality. Local & private.
speak article.md --stream → Audio starts in seconds
Install
For AI Agents (Claude Code, Cursor, Windsurf):
npx skills add EmZod/speak
CLI:
git clone https://github.com/EmZod/speak.git
cd speak && bun install
alias speak="bun run $(pwd)/src/index.ts"
Requirements: macOS Apple Silicon · Bun · Python 3.10+ · sox (brew install sox)
Usage
speak "Hello, world!" --play # Generate and play
speak article.md --stream # Stream long content
speak document.md --output out.wav # Save to file
speak --clipboard --play # Read from clipboard
Voice Cloning
Clone any voice from a 10-30 second sample:
# Use your cloned voice
speak "Hello" --voice ~/.chatter/voices/morgan_freeman.wav --play
Long Documents
speak book.md --auto-chunk --output book.wav # Auto-chunk for reliability
speak --resume manifest.json # Resume interrupted generation
speak *.md --output-dir ~/Audio/ # Batch processing
speak --estimate document.md # Estimate duration first
Commands
speak <text|file> Generate speech
speak health Check system status
speak models List available models
speak concat <files> Combine audio files
speak daemon kill Stop TTS server
Options
--play Play after generation
--stream Stream as it generates
--output Output file or directory
--voice Custom voice file (WAV)
--auto-chunk Chunk long documents
--estimate Show duration estimate
--dry-run Preview without generating
Performance
Long documents ████████████████████ Streaming, auto-chunk
Voice cloning ████████████████████ Any voice from sample
Emotion tags ████████████████████ [laugh], [sigh], etc.
Quality ████████████████████ Audiobook grade
See Also
Need instant audio (~90ms)? Try speakturbo.
Documentation
| File | Content |
|---|---|
| SKILL.md | Full usage guide for agents |
| docs/usage.md | Complete CLI reference |
| docs/troubleshooting.md | Common issues & fixes |
| AGENTS.md | Architecture & development |
MIT License · Built on Chatterbox TTS
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.