voice-ai-integration

by @qodex-ai in Tools

# Install this skill:

npx skills add qodex-ai/ai-agent-skills --skill "voice-ai-integration"

Install specific skill from multi-skill repository

# Description

Build voice-enabled AI applications with speech recognition, text-to-speech, and voice-based interactions. Supports multiple voice providers and real-time processing. Use when creating voice assistants, voice-controlled applications, audio interfaces, or hands-free AI systems.

# SKILL.md

name: voice-ai-integration
description: Build voice-enabled AI applications with speech recognition, text-to-speech, and voice-based interactions. Supports multiple voice providers and real-time processing. Use when creating voice assistants, voice-controlled applications, audio interfaces, or hands-free AI systems.

Voice AI Integration

Build intelligent voice-enabled AI applications that understand spoken language and respond naturally through audio, creating seamless voice-first user experiences.

Overview

Voice AI systems combine three key capabilities:
1. Speech Recognition - Convert audio input to text
2. Natural Language Processing - Understand intent and context
3. Text-to-Speech - Generate natural-sounding responses

Speech Recognition Providers

See examples/speech_recognition_providers.py for implementations:
- Google Cloud Speech-to-Text: High accuracy with automatic punctuation
- OpenAI Whisper: Robust multilingual speech recognition
- Azure Speech Services: Enterprise-grade speech recognition
- AssemblyAI: Async processing with high accuracy

Text-to-Speech Providers

See examples/text_to_speech_providers.py for implementations:
- Google Cloud TTS: Natural voices with multiple language support
- OpenAI TTS: Simple integration with high-quality output
- Azure Speech Services: Enterprise TTS with neural voices
- Eleven Labs: Premium voices with emotional control

Voice Assistant Architecture

See examples/voice_assistant.py for VoiceAssistant:
- Complete voice pipeline: STT → NLP → TTS
- Conversation history management
- Multi-provider support (OpenAI, Google, Azure, etc.)
- Async processing for responsive interactions

Real-Time Voice Processing

See examples/realtime_voice_processor.py for RealTimeVoiceProcessor:
- Stream audio input from microphone
- Stream audio output to speakers
- Voice Activity Detection (VAD)
- Configurable sample rates and chunk sizes

Voice Agent Applications

Voice-Controlled Smart Home

class SmartHomeVoiceAgent:
    def __init__(self):
        self.voice_assistant = VoiceAssistant()
        self.devices = {
            "lights": SmartLights(),
            "temperature": SmartThermostat(),
            "security": SecuritySystem()
        }

    async def handle_voice_command(self, audio_input):
        # Get text from voice
        command_text = await self.voice_assistant.process_voice_input(audio_input)

        # Parse intent
        intent = parse_smart_home_intent(command_text)

        # Execute command
        if intent.action == "turn_on_lights":
            self.devices["lights"].turn_on(intent.room)
        elif intent.action == "set_temperature":
            self.devices["temperature"].set(intent.value)

        # Confirm with voice
        response = f"I've {intent.action_description}"
        audio_output = await self.voice_assistant.synthesize_response(response)

        return audio_output

Voice Meeting Transcription

class VoiceMeetingRecorder:
    def __init__(self):
        self.processor = RealTimeVoiceProcessor()
        self.transcripts = []

    async def record_and_transcribe_meeting(self, duration_seconds=3600):
        audio_stream = self.processor.stream_audio_input()

        buffer = []
        chunk_duration = 30  # Transcribe every 30 seconds

        for audio_chunk in audio_stream:
            buffer.append(audio_chunk)

            if sum(len(chunk) for chunk in buffer) >= chunk_duration * 16000:
                # Transcribe chunk
                transcript = transcribe_audio_whisper(buffer)
                self.transcripts.append({
                    "timestamp": datetime.now(),
                    "text": transcript
                })
                buffer = []

        return self.transcripts

Best Practices

Audio Quality

✓ Use 16kHz sample rate for speech recognition
✓ Handle background noise filtering
✓ Implement voice activity detection (VAD)
✓ Normalize audio levels
✓ Use appropriate audio format (WAV for quality)

Latency Optimization

✓ Use low-latency STT models
✓ Implement streaming transcription
✓ Cache common responses
✓ Use async processing
✓ Minimize network round trips

Error Handling

✓ Handle network failures gracefully
✓ Implement fallback voices/providers
✓ Log audio processing failures
✓ Validate audio quality before processing
✓ Implement retry logic

Privacy & Security

✓ Encrypt audio in transit
✓ Delete audio after processing
✓ Implement user consent mechanisms
✓ Log access to audio data
✓ Comply with data regulations (GDPR, CCPA)

Common Challenges & Solutions

Challenge: Accents and Dialects

Solutions:
- Use multilingual models
- Fine-tune on regional data
- Implement language detection
- Use domain-specific vocabularies

Challenge: Background Noise

Solutions:
- Implement noise filtering
- Use beamforming techniques
- Pre-process audio with noise removal
- Deploy microphone arrays

Challenge: Long Audio Files

Solutions:
- Implement chunked processing
- Use streaming APIs
- Split into speaker turns
- Implement caching

Frameworks & Libraries

Speech Recognition

OpenAI Whisper
Google Cloud Speech-to-Text
Azure Speech Services
AssemblyAI
DeepSpeech

Text-to-Speech

Google Cloud Text-to-Speech
OpenAI TTS
Azure Text-to-Speech
Eleven Labs
Tacotron 2

Getting Started

Choose STT and TTS providers
Set up authentication
Build basic voice pipeline
Add conversation management
Implement error handling
Test with real users
Monitor and optimize latency

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.