Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add EmZod/speak
Or install specific skill: npx add-skill https://github.com/EmZod/speak
# Description
Give your agent the ability to speak to you real-time. Talk to your Claude! Local TTS, text-to-speech, voice synthesis, audio generation with voice cloning on Apple Silicon. Use for reading articles aloud, audiobook narration, or voice responses. Runs entirely on-device via MLX - private, no API keys.
# SKILL.md
name: speak-tts
description: Give your agent the ability to speak to you real-time. Talk to your Claude! Local TTS, text-to-speech, voice synthesis, audio generation with voice cloning on Apple Silicon. Use for reading articles aloud, audiobook narration, or voice responses. Runs entirely on-device via MLX - private, no API keys.
speak - Talk to your Claude!
Give your agent the ability to speak to you real-time. Local text-to-speech, voice cloning, and audio generation on Apple Silicon.
Give your agent the ability to speak to you real-time. Local TTS with voice cloning on Apple Silicon.
Prerequisites
| Requirement | Check | Install |
|---|---|---|
| Apple Silicon Mac | uname -m β arm64 |
Intel not supported |
| macOS 12.0+ | sw_vers |
- |
| sox | which sox |
brew install sox |
| ffmpeg | which ffmpeg |
brew install ffmpeg |
| poppler (PDF) | which pdftotext |
brew install poppler |
Input Sources
| Source | Example |
|---|---|
| Text file | speak article.txt |
| Markdown | speak doc.md |
| Direct string | speak "Hello" |
| Clipboard | pbpaste \| speak |
| Stdin | cat file.txt \| speak |
Web Articles
lynx -dump -nolist "https://example.com/article" | speak --output article.wav
Converting Formats
| Format | Convert Command |
|---|---|
pdftotext doc.pdf doc.txt |
|
| DOCX | textutil -convert txt doc.docx |
| HTML | pandoc -f html -t plain doc.html > doc.txt |
Output Modes
| Goal | Command |
|---|---|
| Save for later | speak text.txt --output file.wav |
| Listen now (streaming) | speak text.txt --stream |
| Listen now (complete) | speak text.txt --play |
| Both | speak text.txt --stream --output file.wav |
Default Behavior
speak article.txt # β ~/Audio/speak/article.wav (no playback)
speak "Hello" # β ~/Audio/speak/speak_<timestamp>.wav
Directory Auto-Creation
| Directory | Auto-Created? |
|---|---|
~/Audio/speak/ |
β Yes |
~/.chatter/voices/ |
β No |
| Custom directories | β No |
Always create custom directories first:
mkdir -p ~/.chatter/voices/
mkdir -p ~/Audio/custom/
Voice Cloning
Voice cloning generates speech that matches your vocal characteristics (pitch, tone, cadence) from a short recording.
Quality Expectations
- Output captures general voice characteristics but is not a perfect replica
- Quality depends heavily on sample quality
- 15-25 seconds is optimal (10s minimum, 30s maximum)
Recording Your Voice
Using QuickTime:
1. Open QuickTime Player β File β New Audio Recording
2. Record 20 seconds of clear speech
3. File β Export As β Audio Only (.m4a)
4. Convert to WAV (see below)
Using sox (command line):
# -d = use default microphone
# Recording starts immediately and stops after 25 seconds
sox -d -r 24000 -c 1 ~/.chatter/voices/my_voice.wav trim 0 25
Converting to Required Format
Voice samples MUST be: WAV, 24000 Hz, mono, 10-30 seconds.
# From MP3
ffmpeg -i voice.mp3 -ar 24000 -ac 1 voice.wav
# From M4A (QuickTime)
ffmpeg -i voice.m4a -ar 24000 -ac 1 voice.wav
# Trim to 25 seconds
ffmpeg -i long.wav -t 25 -ar 24000 -ac 1 trimmed.wav
# Check sample properties
ffprobe -i voice.wav 2>&1 | grep -E "Duration|Stream"
# Should show: Duration ~15-25s, 24000 Hz, mono
Using Your Voice
# Create directory
mkdir -p ~/.chatter/voices/
# Move sample
mv voice.wav ~/.chatter/voices/my_voice.wav
# Test
speak "Testing my voice" --voice ~/.chatter/voices/my_voice.wav --stream
# Use for content
speak notes.txt --voice ~/.chatter/voices/my_voice.wav --output presentation.wav
Path requirements:
- β Works: ~/.chatter/voices/my_voice.wav (tilde expanded by shell)
- β Works: /Users/name/.chatter/voices/my_voice.wav
- β Fails: my_voice.wav (relative path)
- β Fails: ./voices/my_voice.wav (relative path)
Voice Sample Tips
| Good Sample | Bad Sample |
|---|---|
| Quiet room | Background noise |
| Natural pace | Rushed or monotone |
| Clear diction | Mumbling |
| Varied content | Repetitive phrases |
Default Voice
When --voice is omitted, a built-in default voice is used:
speak "Hello world" --stream # Uses default voice
Emotion Tags
Tags produce audible effects (actual sounds), not spoken words:
speak "[sigh] Monday again." --stream
# Output: (sigh sound) "Monday again."
| Tag | Effect |
|---|---|
[laugh] |
Laughter |
[chuckle] |
Light chuckle |
[sigh] |
Sighing |
[gasp] |
Gasping |
[groan] |
Groaning |
[clear throat] |
Throat clearing |
[cough] |
Coughing |
[crying] |
Crying |
[singing] |
Sung speech |
NOT supported: [pause], [whisper] (ignored)
For pauses: Use punctuation: "Wait... let me think."
Batch Processing
mkdir -p ~/Audio/book/
speak ch01.txt ch02.txt ch03.txt --output-dir ~/Audio/book/
# Creates: ch01.wav, ch02.wav, ch03.wav
# With auto-chunking (for long files)
speak chapters/*.txt --output-dir ~/Audio/book/ --auto-chunk
# Skip completed files
speak chapters/*.txt --output-dir ~/Audio/book/ --skip-existing
Auto-Chunk Behavior
When using --auto-chunk with batch processing:
1. Each input file is chunked independently
2. Chunks are generated and automatically concatenated per file
3. Final output: one .wav per input file (e.g., ch01.wav)
4. Intermediate chunks deleted (unless --keep-chunks)
You don't need to manually concatenate chunks β only concatenate final chapter files.
Concatenating Audio
# Explicit order (recommended)
speak concat ch01.wav ch02.wav ch03.wav --output book.wav
# Glob pattern (REQUIRES zero-padded filenames)
speak concat audiobook/*.wav --output book.wav
Zero-Padding Rules
Critical for correct concatenation order:
| Files | Correct | Wrong |
|---|---|---|
| 1-9 | 01, 02, ..., 09 |
1, 2, ..., 9 |
| 10-99 | 01, 02, ..., 99 |
1, 10, 2, ... |
| 100+ | 001, 002, ..., 999 |
1, 100, 2, ... |
Why: Shell glob expansion sorts alphabetically. 1, 10, 2 vs 01, 02, 10.
PDF to Audiobook (Complete Workflow)
Step 1: Find Chapter Boundaries
# Preview table of contents
pdftotext -f 1 -l 5 textbook.pdf toc.txt
cat toc.txt # Note chapter page numbers
# Or search for "Chapter" markers
pdftotext textbook.pdf - | grep -n "Chapter"
Step 2: Extract Chapters (Zero-Padded!)
# For 100-page book with ~10 chapters
pdftotext -f 1 -l 12 -layout textbook.pdf ch01.txt
pdftotext -f 13 -l 25 -layout textbook.pdf ch02.txt
pdftotext -f 26 -l 38 -layout textbook.pdf ch03.txt
# ... continue for all chapters
Step 3: Estimate Time
speak --estimate ch*.txt
# Shows: total audio duration, generation time, storage needed
# Quick estimates:
# 1 page β 2 min audio β 1 min generation
# 100 pages β 200 min audio β 100 min generation β 500 MB
Step 4: Generate Audio
mkdir -p audiobook/
speak ch01.txt ch02.txt ch03.txt --output-dir audiobook/ --auto-chunk
# Creates: audiobook/ch01.wav, audiobook/ch02.wav, audiobook/ch03.wav
Step 5: Concatenate
speak concat audiobook/ch01.wav audiobook/ch02.wav audiobook/ch03.wav --output complete_audiobook.wav
# Or with glob (only if zero-padded):
speak concat audiobook/ch*.wav --output complete_audiobook.wav
PDF Troubleshooting
| Issue | Solution |
|---|---|
| Empty/garbled text | Scanned PDF β use OCR: brew install tesseract |
| Wrong encoding | Try: pdftotext -enc UTF-8 doc.pdf |
| Check word count | pdftotext doc.pdf - \| wc -w (should be >100) |
Multi-Voice Content
mkdir -p podcast/scripts podcast/wav
echo "Welcome to the show." > podcast/scripts/01_host.txt
echo "Thanks for having me." > podcast/scripts/02_guest.txt
speak podcast/scripts/01_host.txt --voice ~/.chatter/voices/host.wav --output podcast/wav/01.wav
speak podcast/scripts/02_guest.txt --voice ~/.chatter/voices/guest.wav --output podcast/wav/02.wav
speak concat podcast/wav/01.wav podcast/wav/02.wav --output podcast.wav
Options Reference
| Option | Description | Default |
|---|---|---|
--stream |
Stream as it generates | false |
--play |
Play after complete | false |
--output <path> |
Output file | ~/Audio/speak/ |
--output-dir <dir> |
Batch output directory | - |
--voice <path> |
Voice sample (full path) | default |
--timeout <sec> |
Timeout per file | 300 |
--auto-chunk |
Split long documents | false |
--chunk-size <n> |
Chars per chunk | 6000 |
--resume <file> |
Resume from manifest | - |
--keep-chunks |
Keep intermediate files | false |
--skip-existing |
Skip if output exists | false |
--estimate |
Show duration estimate | false |
--dry-run |
Preview only | false |
--quiet |
Suppress output | false |
Commands
| Command | Description |
|---|---|
speak setup |
Set up environment |
speak health |
Check system status |
speak models |
List TTS models |
speak concat |
Concatenate audio |
speak daemon kill |
Stop TTS server |
speak config |
Show configuration |
Performance
| Metric | Value |
|---|---|
| Cold start | ~4-8s |
| Warm start | ~3-8s |
| Speed | 0.3-0.5x RTF (faster than real-time) |
| Storage | ~2.5 MB/min, ~150 MB/hour |
Resume Capability
For interrupted long generations:
# Single file with auto-chunk β use --resume
speak long.txt --auto-chunk --output book.wav
# If interrupted, manifest saved at ~/Audio/speak/manifest.json
speak --resume ~/Audio/speak/manifest.json
# Batch processing β use --skip-existing
speak ch*.txt --output-dir audiobook/ --auto-chunk
# If interrupted, re-run same command:
speak ch*.txt --output-dir audiobook/ --auto-chunk --skip-existing
Common Errors
| Error | Cause | Solution |
|---|---|---|
| "Voice file not found" | Relative path | Use full path: ~/.chatter/voices/x.wav |
| "Invalid WAV format" | Wrong specs | Convert: ffmpeg -i in.wav -ar 24000 -ac 1 out.wav |
| "Voice sample too short" | <10 seconds | Record 15-25 seconds |
| "Output directory doesn't exist" | Not created | mkdir -p dirname/ |
| "sox not found" | Not installed | brew install sox |
| Scrambled concat order | Non-zero-padded | Use 01, 02, not 1, 2 |
| Timeout | >5 min generation | Use --auto-chunk or --timeout 600 |
| "Server not running" | Stale daemon | speak daemon kill && speak health |
Setup
speak "test" # Auto-setup on first run (downloads model ~500MB)
speak setup # Or manual setup
speak health # Verify everything works
Server Management
Server auto-starts and shuts down after 1 hour idle.
speak health # Check status
speak daemon kill # Stop manually
# README.md
βββββββββββββββ ββββββββ ββββββ βββ βββ
βββββββββββββββββββββββββββββββββββ ββββ
ββββββββββββββββββββββ βββββββββββββββ
βββββββββββββββ ββββββ βββββββββββββββ
βββββββββββ βββββββββββ ββββββ βββ
βββββββββββ βββββββββββ ββββββ βββ
Talk to your Claude.
Voice cloning. Long documents. Audiobook quality. Local & private.
speak article.md --stream β Audio starts in seconds
Install
For AI Agents (Claude Code, Cursor, Windsurf):
npx skills add EmZod/speak
CLI:
git clone https://github.com/EmZod/speak.git
cd speak && bun install
alias speak="bun run $(pwd)/src/index.ts"
Requirements: macOS Apple Silicon Β· Bun Β· Python 3.10+ Β· sox (brew install sox)
Usage
speak "Hello, world!" --play # Generate and play
speak article.md --stream # Stream long content
speak document.md --output out.wav # Save to file
speak --clipboard --play # Read from clipboard
Voice Cloning
Clone any voice from a 10-30 second sample:
# Use your cloned voice
speak "Hello" --voice ~/.chatter/voices/morgan_freeman.wav --play
Long Documents
speak book.md --auto-chunk --output book.wav # Auto-chunk for reliability
speak --resume manifest.json # Resume interrupted generation
speak *.md --output-dir ~/Audio/ # Batch processing
speak --estimate document.md # Estimate duration first
Commands
speak <text|file> Generate speech
speak health Check system status
speak models List available models
speak concat <files> Combine audio files
speak daemon kill Stop TTS server
Options
--play Play after generation
--stream Stream as it generates
--output Output file or directory
--voice Custom voice file (WAV)
--auto-chunk Chunk long documents
--estimate Show duration estimate
--dry-run Preview without generating
Performance
Long documents ββββββββββββββββββββ Streaming, auto-chunk
Voice cloning ββββββββββββββββββββ Any voice from sample
Emotion tags ββββββββββββββββββββ [laugh], [sigh], etc.
Quality ββββββββββββββββββββ Audiobook grade
See Also
Need instant audio (~90ms)? Try speakturbo.
Documentation
| File | Content |
|---|---|
| SKILL.md | Full usage guide for agents |
| docs/usage.md | Complete CLI reference |
| docs/troubleshooting.md | Common issues & fixes |
| AGENTS.md | Architecture & development |
MIT License Β· Built on Chatterbox TTS
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.