Use when you have a written implementation plan to execute in a separate session with review checkpoints
npx skills add qodex-ai/ai-agent-skills --skill "voice-ai-integration"
Install specific skill from multi-skill repository
# Description
Build voice-enabled AI applications with speech recognition, text-to-speech, and voice-based interactions. Supports multiple voice providers and real-time processing. Use when creating voice assistants, voice-controlled applications, audio interfaces, or hands-free AI systems.
# SKILL.md
name: voice-ai-integration
description: Build voice-enabled AI applications with speech recognition, text-to-speech, and voice-based interactions. Supports multiple voice providers and real-time processing. Use when creating voice assistants, voice-controlled applications, audio interfaces, or hands-free AI systems.
Voice AI Integration
Build intelligent voice-enabled AI applications that understand spoken language and respond naturally through audio, creating seamless voice-first user experiences.
Overview
Voice AI systems combine three key capabilities:
1. Speech Recognition - Convert audio input to text
2. Natural Language Processing - Understand intent and context
3. Text-to-Speech - Generate natural-sounding responses
Speech Recognition Providers
See examples/speech_recognition_providers.py for implementations:
- Google Cloud Speech-to-Text: High accuracy with automatic punctuation
- OpenAI Whisper: Robust multilingual speech recognition
- Azure Speech Services: Enterprise-grade speech recognition
- AssemblyAI: Async processing with high accuracy
Text-to-Speech Providers
See examples/text_to_speech_providers.py for implementations:
- Google Cloud TTS: Natural voices with multiple language support
- OpenAI TTS: Simple integration with high-quality output
- Azure Speech Services: Enterprise TTS with neural voices
- Eleven Labs: Premium voices with emotional control
Voice Assistant Architecture
See examples/voice_assistant.py for VoiceAssistant:
- Complete voice pipeline: STT → NLP → TTS
- Conversation history management
- Multi-provider support (OpenAI, Google, Azure, etc.)
- Async processing for responsive interactions
Real-Time Voice Processing
See examples/realtime_voice_processor.py for RealTimeVoiceProcessor:
- Stream audio input from microphone
- Stream audio output to speakers
- Voice Activity Detection (VAD)
- Configurable sample rates and chunk sizes
Voice Agent Applications
Voice-Controlled Smart Home
class SmartHomeVoiceAgent:
def __init__(self):
self.voice_assistant = VoiceAssistant()
self.devices = {
"lights": SmartLights(),
"temperature": SmartThermostat(),
"security": SecuritySystem()
}
async def handle_voice_command(self, audio_input):
# Get text from voice
command_text = await self.voice_assistant.process_voice_input(audio_input)
# Parse intent
intent = parse_smart_home_intent(command_text)
# Execute command
if intent.action == "turn_on_lights":
self.devices["lights"].turn_on(intent.room)
elif intent.action == "set_temperature":
self.devices["temperature"].set(intent.value)
# Confirm with voice
response = f"I've {intent.action_description}"
audio_output = await self.voice_assistant.synthesize_response(response)
return audio_output
Voice Meeting Transcription
class VoiceMeetingRecorder:
def __init__(self):
self.processor = RealTimeVoiceProcessor()
self.transcripts = []
async def record_and_transcribe_meeting(self, duration_seconds=3600):
audio_stream = self.processor.stream_audio_input()
buffer = []
chunk_duration = 30 # Transcribe every 30 seconds
for audio_chunk in audio_stream:
buffer.append(audio_chunk)
if sum(len(chunk) for chunk in buffer) >= chunk_duration * 16000:
# Transcribe chunk
transcript = transcribe_audio_whisper(buffer)
self.transcripts.append({
"timestamp": datetime.now(),
"text": transcript
})
buffer = []
return self.transcripts
Best Practices
Audio Quality
- ✓ Use 16kHz sample rate for speech recognition
- ✓ Handle background noise filtering
- ✓ Implement voice activity detection (VAD)
- ✓ Normalize audio levels
- ✓ Use appropriate audio format (WAV for quality)
Latency Optimization
- ✓ Use low-latency STT models
- ✓ Implement streaming transcription
- ✓ Cache common responses
- ✓ Use async processing
- ✓ Minimize network round trips
Error Handling
- ✓ Handle network failures gracefully
- ✓ Implement fallback voices/providers
- ✓ Log audio processing failures
- ✓ Validate audio quality before processing
- ✓ Implement retry logic
Privacy & Security
- ✓ Encrypt audio in transit
- ✓ Delete audio after processing
- ✓ Implement user consent mechanisms
- ✓ Log access to audio data
- ✓ Comply with data regulations (GDPR, CCPA)
Common Challenges & Solutions
Challenge: Accents and Dialects
Solutions:
- Use multilingual models
- Fine-tune on regional data
- Implement language detection
- Use domain-specific vocabularies
Challenge: Background Noise
Solutions:
- Implement noise filtering
- Use beamforming techniques
- Pre-process audio with noise removal
- Deploy microphone arrays
Challenge: Long Audio Files
Solutions:
- Implement chunked processing
- Use streaming APIs
- Split into speaker turns
- Implement caching
Frameworks & Libraries
Speech Recognition
- OpenAI Whisper
- Google Cloud Speech-to-Text
- Azure Speech Services
- AssemblyAI
- DeepSpeech
Text-to-Speech
- Google Cloud Text-to-Speech
- OpenAI TTS
- Azure Text-to-Speech
- Eleven Labs
- Tacotron 2
Getting Started
- Choose STT and TTS providers
- Set up authentication
- Build basic voice pipeline
- Add conversation management
- Implement error handling
- Test with real users
- Monitor and optimize latency
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.