Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add akrindev/google-studio-skills --skill "gemini-tts"
Install specific skill from multi-skill repository
# Description
Generate speech from text using Google Gemini TTS models via scripts/. Use for text-to-speech, audio generation, voice synthesis, multi-speaker conversations, and creating audio content. Supports multiple voices and streaming. Triggers on "text to speech", "TTS", "generate audio", "voice synthesis", "speak this text".
# SKILL.md
name: gemini-tts
description: Generate speech from text using Google Gemini TTS models via scripts/. Use for text-to-speech, audio generation, voice synthesis, multi-speaker conversations, and creating audio content. Supports multiple voices and streaming. Triggers on "text to speech", "TTS", "generate audio", "voice synthesis", "speak this text".
license: MIT
version: 1.0.0
keywords: text-to-speech, TTS, audio generation, voice synthesis, multi-speaker, streaming, Kore, Puck, Charon, Fenrir, Aoede, Zephyr
Gemini Text-to-Speech
Generate natural-sounding speech from text using Gemini's TTS models through executable scripts with support for multiple voices and multi-speaker conversations.
When to Use This Skill
Use this skill when you need to:
- Convert text to natural speech
- Create audio for podcasts, audiobooks, or videos
- Generate multi-speaker conversations
- Stream audio for long content
- Choose from multiple voice options
- Create accessible audio content
- Generate voiceovers for presentations
- Batch convert text to audio files
Available Scripts
scripts/tts.py
Purpose: Convert text to speech using Gemini TTS models
When to use:
- Any text-to-speech conversion
- Multi-speaker conversation generation
- Streaming audio for long texts
- Voiceovers for content creation
- Accessible audio generation
Key parameters:
| Parameter | Description | Example |
|-----------|-------------|---------|
| text | Text to convert (required) | "Hello, world!" |
| --voice, -v | Voice name | Kore |
| --output, -o | Base name for output file | welcome |
| --output-dir | Output directory for audio | audio/ |
| --no-timestamp | Disable auto timestamp | Flag |
| --model, -m | TTS model | gemini-2.5-flash-preview-tts |
| --stream, -s | Enable streaming | Flag |
| --speakers | Multi-speaker mapping | "Joe:Kore,Jane:Puck" |
Output: WAV audio file path
Workflows
Workflow 1: Basic Text-to-Speech
python scripts/tts.py "Hello, world! Have a wonderful day."
- Best for: Quick audio generation, simple messages
- Voice:
Kore(default, clear and professional) - Output:
audio/tts_output_YYYYMMDD_HHMMSS.wav(auto timestamp)
Workflow 2: Choose Different Voice
python scripts/tts.py "Welcome to our podcast about technology trends" --voice Puck --output welcome
- Best for: Friendly, conversational content
- Voice options: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
- Output:
audio/welcome_YYYYMMDD_HHMMSS.wav
Workflow 3: Multi-Speaker Conversation
python scripts/tts.py "TTS the following conversation:
Joe: How's it going today?
Jane: Not too bad, how about you?
Joe: I'm working on a new project.
Jane: Sounds exciting, tell me more!" --speakers "Joe:Kore,Jane:Puck" --output conversation
- Best for: Dialogues, interviews, role-playing content
- Format: Marked conversation with speaker names
- Script automatically routes text to appropriate voices
- Output:
audio/conversation_YYYYMMDD_HHMMSS.wav
Workflow 4: Long Content with Streaming
python scripts/tts.py "This is a very long text that would benefit from streaming..." --stream --output long-form
- Best for: Podcasts, audiobooks, long articles
- Streaming: Processes audio in chunks for long texts
- Output:
audio/long-form_YYYYMMDD_HHMMSS.wav
Workflow 5: Professional Voiceover
python scripts/tts.py "Welcome to our quarterly earnings presentation. Today we'll discuss our growth metrics and future plans." --voice Charon --output voiceover
- Best for: Corporate content, presentations, formal announcements
- Voice:
Charon(deep, authoritative) - Use when: Professional, serious tone required
Workflow 6: Custom Output Directory
python scripts/tts.py "Save to specific folder." --output-dir ./my-projects/podcasts/ --output episode1
- Best for: Organized project structures
- Directory created automatically if it doesn't exist
- Output:
./my-projects/podcasts/episode1_YYYYMMDD_HHMMSS.wav
Workflow 7: Content Creation Pipeline (Text โ Audio)
# 1. Generate script (gemini-text skill)
python skills/gemini-text/scripts/generate.py "Write a 2-minute podcast intro about sustainable energy"
# 2. Generate audio (this skill)
python scripts/tts.py "[Paste generated script]" --voice Fenrir --output podcast-intro
# 3. Use in video or podcast
- Best for: Podcasts, audiobooks, video narration
- Combines with: gemini-text for script generation
Workflow 8: Accessible Content
python scripts/tts.py "Welcome to our accessible website. This audio describes our main navigation options." --voice Aoede --output accessibility
- Best for: Web accessibility, screen reader alternatives
- Voice:
Aoede(melodic, pleasant) - Use when: Making content accessible to visually impaired users
Workflow 9: Educational Content
python scripts/tts.py "Chapter 1: Introduction to Quantum Computing. Let's explore the fundamental principles..." --voice Zephyr --output chapter1
- Best for: Educational materials, tutorials, e-learning
- Voice:
Zephyr(light, airy) - Combines well with: gemini-text for content generation
Workflow 10: Disable Timestamp
python scripts/tts.py "Fixed filename." --output my-audio --no-timestamp
- Best for: When you want complete control over filename
- Output:
audio/my-audio.wav(no timestamp) - Use when: Generating files for specific naming schemes
Parameters Reference
Model Selection
| Model | Quality | Speed | Best For |
|---|---|---|---|
gemini-2.5-flash-preview-tts |
Good | Fast | General use, high volume |
gemini-2.5-pro-preview-tts |
Higher | Slower | Premium content, voiceovers |
Voice Selection
| Voice | Characteristics | Best For |
|---|---|---|
| Kore | Clear, professional | Announcements, general purpose (default) |
| Puck | Friendly, conversational | Casual content, interviews |
| Charon | Deep, authoritative | Corporate, serious content |
| Fenrir | Warm, expressive | Storytelling, narratives |
| Aoede | Melodic, pleasant | Educational, accessibility |
| Zephyr | Light, airy | Gentle content, tutorials |
| Sulafat | Neutral, balanced | Documentaries, factual content |
Audio Format
| Specification | Value |
|---|---|
| Format | WAV (PCM) |
| Sample rate | 24000 Hz |
| Channels | 1 (mono) |
| Bit depth | 16-bit |
Token Limits
| Limit | Type | Description |
|---|---|---|
| 8,192 | Input | Maximum input text tokens |
| 16,384 | Output | Maximum output audio tokens |
Output Interpretation
Audio File
- Format: WAV (compatible with most players)
- Mono channel (single audio track)
- Sample rate: 24000 Hz (broadcast quality)
- Can be converted to MP3/AAC if needed
Multi-Speaker Files
- Single WAV file with multiple voices
- Voices separated by timing within file
- Use
--speakersparameter to map speakers to voices
Streaming Output
- Audio processed in chunks during generation
- Script shows "Streaming audio..." message
- Useful for very long texts or real-time applications
Common Issues
"google-genai not installed"
pip install google-genai
"Voice name not found"
- Check voice name spelling
- Use available voices: Kore, Puck, Charon, Fenrir, Aoede, Zephyr, Sulafat
- Voice names are case-sensitive
"No audio generated"
- Check text is not empty
- Verify text doesn't exceed token limit (8,192)
- Try shorter text segments
- Check API quota limits
"Multi-speaker format error"
- Format:
SpeakerName:VoiceName,Speaker2:Voice2 - Separate speakers with commas
- Use colon between speaker and voice
- Example:
"Joe:Kore,Jane:Puck,Host:Charon"
"Output file already exists"
- Script will overwrite existing files
- Change
--outputfilename to avoid conflicts - Use unique names for batch generation
Audio quality issues
- Check input text for unusual characters
- Try different voice for better pronunciation
- Consider splitting long text into smaller segments
- Verify audio playback software compatibility
Best Practices
Voice Selection
- Kore: General purpose, clear articulation
- Puck: Conversational, engaging tone
- Charon: Professional, authoritative
- Fenrir: Emotional, storytelling
- Aoede: Soft, gentle for accessibility
- Zephyr: Educational, clear explanations
Text Preparation
- Use natural language and punctuation
- Include pauses with commas and periods
- Spell out difficult words if needed
- Break very long text into logical segments
- Add speaker labels for multi-speaker content
Performance Optimization
- Use streaming for very long texts
- Generate shorter segments for better control
- Use flash model for faster generation
- Batch process multiple files for efficiency
Quality Tips
- Test different voices for your content type
- Use appropriate pacing with punctuation
- Consider context when selecting voice
- Listen to output before final use
- Multi-speaker requires clear speaker labeling
Use Cases by Voice
| Voice | Ideal Use Cases |
|---|---|
| Kore | Announcements, navigation, general info |
| Puck | Podcasts, interviews, casual content |
| Charon | Corporate, news, formal presentations |
| Fenrir | Audiobooks, stories, emotional content |
| Aoede | Accessibility, educational, gentle content |
| Zephyr | Tutorials, explanations, guides |
| Sulafat | Documentaries, factual presentations |
Related Skills
- gemini-text: Generate scripts and text for TTS
- gemini-image: Create visuals to accompany audio
- gemini-batch: Process multiple TTS requests efficiently
- gemini-files: Upload audio files for processing
Quick Reference
# Basic
python scripts/tts.py "Your text here"
# Custom voice
python scripts/tts.py "Your text" --voice Puck --output audio.wav
# Multi-speaker
python scripts/tts.py "Joe: Hi. Jane: Hello!" --speakers "Joe:Kore,Jane:Puck"
# Streaming
python scripts/tts.py "Long text..." --stream --output long.wav
# Professional
python scripts/tts.py "Corporate announcement" --voice Charon
Reference
- See
references/voices.mdfor complete voice documentation - Get API key: https://aistudio.google.com/apikey
- Documentation: https://ai.google.dev/gemini-api/docs/text-to-speech
- Sample rate: 24000 Hz standard for most applications
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.