Use when you have a written implementation plan to execute in a separate session with review checkpoints
npx skills add michaelboeding/skills --skill "voice-generation"
Install specific skill from multi-skill repository
# Description
>
# SKILL.md
name: voice-generation
description: >
Use this skill for AI text-to-speech generation. Triggers include:
"generate voice", "create audio", "text to speech", "TTS", "read this aloud",
"generate narration", "create voiceover", "synthesize speech", "podcast audio",
"dialogue audio", "multi-speaker", "audiobook"
Supports Google Gemini TTS, ElevenLabs, and OpenAI TTS.
Voice Generation Skill
Generate realistic speech using AI (Google Gemini TTS, ElevenLabs, OpenAI TTS).
Prerequisites
At least one API key is required:
GOOGLE_API_KEY- For Google Gemini TTS (same key as video/image/music) βELEVENLABS_API_KEY- For ElevenLabs high-quality voice synthesisOPENAI_API_KEY- For OpenAI TTS voices
Available APIs
Google Gemini TTS (Recommended - Same API Key)
- Best for: Podcasts, dialogues, audiobooks with style control
- Voices: 30 voices with natural language style control
- Multi-speaker: Up to 2 speakers for dialogues β
- Languages: 24 languages (auto-detected)
- Features: Control style, accent, pace via prompts
- Output: 24kHz WAV
- API Key: Same
GOOGLE_API_KEYas video/image/music β
ElevenLabs (Best Quality)
- Best for: Natural-sounding voices, voice cloning, long-form content
- Voices: 100+ pre-made voices + custom voice cloning
- Languages: 29+ languages
- Models: Eleven Multilingual v2, Eleven Turbo v2
OpenAI TTS (Simplest)
- Best for: Quick, reliable text-to-speech with consistent quality
- Voices: alloy, echo, fable, onyx, nova, shimmer
- Models: tts-1 (fast), tts-1-hd (high quality)
- Output: MP3, Opus, AAC, FLAC
Workflow
Step 1: Understand the Request
Parse the user's voice request for:
- Text content: What should be spoken?
- Voice type: Male, female, specific character?
- Tone: Professional, casual, dramatic, cheerful?
- Use case: Narration, voiceover, audiobook, notification?
- Language: English, Spanish, other?
- Speed: Normal, slow, fast?
Step 2: Select Voice and API
Choose based on requirements:
| Use Case | Recommended API | Reason |
|---|---|---|
| Default / Same key as video | Gemini TTS | Same GOOGLE_API_KEY β
|
| Multi-speaker dialogue | Gemini TTS | Up to 2 speakers built-in |
| Style/accent control | Gemini TTS | Natural language prompts |
| Voice cloning | ElevenLabs | Only API with cloning |
| 100+ voice options | ElevenLabs | Widest selection |
| Audiobook/podcast | ElevenLabs or Gemini | Both excellent for long content |
| Quick narration | OpenAI TTS | Fast, reliable |
| Budget-conscious | OpenAI TTS | Lower cost |
Step 3: Prepare the Text
Optimize text for speech:
- Add pauses: Use commas, periods for natural rhythm
- Spell out numbers: "1,234" β "one thousand two hundred thirty-four" (if needed)
- Handle acronyms: "NASA" vs "N.A.S.A." depending on pronunciation
- Mark emphasis: Some APIs support emphasis markers
Example transformation:
- Original: "The Q4 2024 results show a 15% YoY increase."
- Optimized: "The Q4 2024 results show a fifteen percent year-over-year increase."
Step 4: Generate the Audio
Execute the appropriate script from ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/:
For Google Gemini TTS (single speaker):
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
--text "Welcome to our podcast!" \
--voice "Charon"
Gemini TTS with style direction:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
--text "Have a wonderful day!" \
--voice "Puck" \
--style "Say cheerfully with a British accent:"
Gemini TTS multi-speaker (dialogue):
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py \
--multi \
--speaker "Host:Charon" \
--speaker "Guest:Aoede" \
--text "Host: Welcome to the show!
Guest: Thanks for having me!"
For ElevenLabs:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/elevenlabs.py \
--text "Your text here" \
--voice "Rachel" \
--model "eleven_multilingual_v2"
For OpenAI TTS:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/openai_tts.py \
--text "Your text here" \
--voice "nova" \
--model "tts-1-hd"
List Gemini voices:
python3 ${CLAUDE_PLUGIN_ROOT}/skills/voice-generation/scripts/gemini_tts.py --list-voices
Step 5: Deliver the Result
- Provide the generated audio file path
- Mention the voice and settings used
- Offer to:
- Try a different voice
- Adjust speed or tone
- Use a different API
- Generate in a different format
Error Handling
Missing API key: Inform the user which key is needed:
- Gemini TTS: Same GOOGLE_API_KEY as video/image - https://aistudio.google.com/apikey
- ElevenLabs: https://elevenlabs.io
- OpenAI: https://platform.openai.com/api-keys
Gemini TTS requires google-genai package: pip install google-genai
Text too long: Split into chunks and concatenate, or suggest shorter text.
Rate limit: Suggest waiting or trying a different API.
Unsupported language: Suggest an alternative API that supports the language.
Multi-speaker limit: Gemini TTS supports max 2 speakers. For more, use ElevenLabs with multiple calls.
Voice Selection Guide
Google Gemini TTS Voices (30 voices)
| Style | Voices | Best For |
|---|---|---|
| Bright/Upbeat | Zephyr, Puck, Aoede, Laomedeia | Marketing, cheerful content |
| Firm/Informative | Charon, Kore, Orus, Rasalgethi | News, tutorials, professional |
| Soft/Warm | Achernar, Sulafat, Vindemiatrix | Meditation, gentle narration |
| Smooth | Algieba, Despina, Callirrhoe | Audiobooks, storytelling |
| Clear | Erinome, Iapetus, Pulcherrima | Instructions, clarity |
| Character | Fenrir (excitable), Enceladus (breathy), Algenib (gravelly), Gacrux (mature) | Character voices, drama |
| Friendly | Achird, Zubenelgenubi (casual) | Casual, conversational |
Gemini TTS Style Tips:
- Use natural language: --style "Say angrily:" or --style "Whisper mysteriously:"
- Specify accents: --style "Speak with a British accent from London:"
- Control pace: --style "Speak slowly and deliberately:"
- Combine: --style "Say excitedly with a Southern US accent:"
OpenAI TTS Voices
| Voice | Description | Best For |
|---|---|---|
| alloy | Neutral, balanced | General purpose |
| echo | Warm, conversational | Podcasts, casual |
| fable | Expressive, British | Storytelling |
| onyx | Deep, authoritative | Narration, professional |
| nova | Friendly, upbeat | Marketing, tutorials |
| shimmer | Soft, gentle | Meditation, ASMR |
ElevenLabs Popular Voices
| Voice | Description | Best For |
|---|---|---|
| Rachel | Young female, American | Narration, audiobooks |
| Domi | Young female, energetic | Marketing, ads |
| Bella | Young female, soft | Storytelling |
| Antoni | Young male, well-rounded | Narration |
| Josh | Young male, deep | Audiobooks |
| Arnold | Mature male, authoritative | Documentary |
| Adam | Middle-aged male, deep | Narration |
| Sam | Young male, raspy | Character voices |
Best Practices
For Narration
- Use a consistent voice throughout
- Add natural pauses between paragraphs
- Consider pacing for the content type
For Dialogue
- Use different voices for different characters
- Match voice characteristics to character descriptions
- Adjust speed for emotional scenes
For Accessibility
- Use clear, well-paced speech
- Avoid overly stylized voices
- Test with screen readers if applicable
API Comparison
| Feature | Gemini TTS | ElevenLabs | OpenAI TTS |
|---|---|---|---|
| API Key | GOOGLE_API_KEY β
|
ELEVENLABS_API_KEY |
OPENAI_API_KEY |
| Voice quality | Excellent | Excellent | Very good |
| Voice variety | 30 voices | 100+ voices | 6 voices |
| Multi-speaker | β Up to 2 | β No | β No |
| Style control | β Natural language | Limited | β No |
| Voice cloning | β No | β Yes | β No |
| Languages | 24 | 29+ | 50+ |
| Speed control | Via prompts | Yes | Yes (0.25-4x) |
| Max length | 32k tokens | 5,000 chars | 4,096 chars |
| Output format | WAV (24kHz) | MP3, WAV | MP3, Opus, AAC, FLAC |
| Same key as video/image | β Yes | β No | β No |
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.