qodex-ai

voice-ai-integration

1
0
# Install this skill:
npx skills add qodex-ai/ai-agent-skills --skill "voice-ai-integration"

Install specific skill from multi-skill repository

# Description

Build voice-enabled AI applications with speech recognition, text-to-speech, and voice-based interactions. Supports multiple voice providers and real-time processing. Use when creating voice assistants, voice-controlled applications, audio interfaces, or hands-free AI systems.

# SKILL.md


name: voice-ai-integration
description: Build voice-enabled AI applications with speech recognition, text-to-speech, and voice-based interactions. Supports multiple voice providers and real-time processing. Use when creating voice assistants, voice-controlled applications, audio interfaces, or hands-free AI systems.


Voice AI Integration

Build intelligent voice-enabled AI applications that understand spoken language and respond naturally through audio, creating seamless voice-first user experiences.

Overview

Voice AI systems combine three key capabilities:
1. Speech Recognition - Convert audio input to text
2. Natural Language Processing - Understand intent and context
3. Text-to-Speech - Generate natural-sounding responses

Speech Recognition Providers

See examples/speech_recognition_providers.py for implementations:
- Google Cloud Speech-to-Text: High accuracy with automatic punctuation
- OpenAI Whisper: Robust multilingual speech recognition
- Azure Speech Services: Enterprise-grade speech recognition
- AssemblyAI: Async processing with high accuracy

Text-to-Speech Providers

See examples/text_to_speech_providers.py for implementations:
- Google Cloud TTS: Natural voices with multiple language support
- OpenAI TTS: Simple integration with high-quality output
- Azure Speech Services: Enterprise TTS with neural voices
- Eleven Labs: Premium voices with emotional control

Voice Assistant Architecture

See examples/voice_assistant.py for VoiceAssistant:
- Complete voice pipeline: STT → NLP → TTS
- Conversation history management
- Multi-provider support (OpenAI, Google, Azure, etc.)
- Async processing for responsive interactions

Real-Time Voice Processing

See examples/realtime_voice_processor.py for RealTimeVoiceProcessor:
- Stream audio input from microphone
- Stream audio output to speakers
- Voice Activity Detection (VAD)
- Configurable sample rates and chunk sizes

Voice Agent Applications

Voice-Controlled Smart Home

class SmartHomeVoiceAgent:
    def __init__(self):
        self.voice_assistant = VoiceAssistant()
        self.devices = {
            "lights": SmartLights(),
            "temperature": SmartThermostat(),
            "security": SecuritySystem()
        }

    async def handle_voice_command(self, audio_input):
        # Get text from voice
        command_text = await self.voice_assistant.process_voice_input(audio_input)

        # Parse intent
        intent = parse_smart_home_intent(command_text)

        # Execute command
        if intent.action == "turn_on_lights":
            self.devices["lights"].turn_on(intent.room)
        elif intent.action == "set_temperature":
            self.devices["temperature"].set(intent.value)

        # Confirm with voice
        response = f"I've {intent.action_description}"
        audio_output = await self.voice_assistant.synthesize_response(response)

        return audio_output

Voice Meeting Transcription

class VoiceMeetingRecorder:
    def __init__(self):
        self.processor = RealTimeVoiceProcessor()
        self.transcripts = []

    async def record_and_transcribe_meeting(self, duration_seconds=3600):
        audio_stream = self.processor.stream_audio_input()

        buffer = []
        chunk_duration = 30  # Transcribe every 30 seconds

        for audio_chunk in audio_stream:
            buffer.append(audio_chunk)

            if sum(len(chunk) for chunk in buffer) >= chunk_duration * 16000:
                # Transcribe chunk
                transcript = transcribe_audio_whisper(buffer)
                self.transcripts.append({
                    "timestamp": datetime.now(),
                    "text": transcript
                })
                buffer = []

        return self.transcripts

Best Practices

Audio Quality

  • ✓ Use 16kHz sample rate for speech recognition
  • ✓ Handle background noise filtering
  • ✓ Implement voice activity detection (VAD)
  • ✓ Normalize audio levels
  • ✓ Use appropriate audio format (WAV for quality)

Latency Optimization

  • ✓ Use low-latency STT models
  • ✓ Implement streaming transcription
  • ✓ Cache common responses
  • ✓ Use async processing
  • ✓ Minimize network round trips

Error Handling

  • ✓ Handle network failures gracefully
  • ✓ Implement fallback voices/providers
  • ✓ Log audio processing failures
  • ✓ Validate audio quality before processing
  • ✓ Implement retry logic

Privacy & Security

  • ✓ Encrypt audio in transit
  • ✓ Delete audio after processing
  • ✓ Implement user consent mechanisms
  • ✓ Log access to audio data
  • ✓ Comply with data regulations (GDPR, CCPA)

Common Challenges & Solutions

Challenge: Accents and Dialects

Solutions:
- Use multilingual models
- Fine-tune on regional data
- Implement language detection
- Use domain-specific vocabularies

Challenge: Background Noise

Solutions:
- Implement noise filtering
- Use beamforming techniques
- Pre-process audio with noise removal
- Deploy microphone arrays

Challenge: Long Audio Files

Solutions:
- Implement chunked processing
- Use streaming APIs
- Split into speaker turns
- Implement caching

Frameworks & Libraries

Speech Recognition

  • OpenAI Whisper
  • Google Cloud Speech-to-Text
  • Azure Speech Services
  • AssemblyAI
  • DeepSpeech

Text-to-Speech

  • Google Cloud Text-to-Speech
  • OpenAI TTS
  • Azure Text-to-Speech
  • Eleven Labs
  • Tacotron 2

Getting Started

  1. Choose STT and TTS providers
  2. Set up authentication
  3. Build basic voice pipeline
  4. Add conversation management
  5. Implement error handling
  6. Test with real users
  7. Monitor and optimize latency

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.