Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add kesslerio/phone-agent-moltbot-skill
Or install specific skill: npx add-skill https://github.com/kesslerio/phone-agent-moltbot-skill
# Description
Run a real-time AI phone agent using Twilio, Deepgram, and ElevenLabs. Handles incoming calls, transcribes audio, generates responses via LLM, and speaks back via streaming TTS. Use when user wants to: (1) Test voice AI capabilities, (2) Handle phone calls programmatically, (3) Build a conversational voice bot.
# SKILL.md
name: phone-agent
description: "Run a real-time AI phone agent using Twilio, Deepgram, and ElevenLabs. Handles incoming calls, transcribes audio, generates responses via LLM, and speaks back via streaming TTS. Use when user wants to: (1) Test voice AI capabilities, (2) Handle phone calls programmatically, (3) Build a conversational voice bot."
Phone Agent Skill
Runs a local FastAPI server that acts as a real-time voice bridge.
Architecture
Twilio (Phone) <--> WebSocket (Audio) <--> [Local Server] <--> Deepgram (STT)
|
+--> OpenAI (LLM)
+--> ElevenLabs (TTS)
Prerequisites
- Twilio Account: Phone number + TwiML App.
- Deepgram API Key: For fast speech-to-text.
- OpenAI API Key: For the conversation logic.
- ElevenLabs API Key: For realistic text-to-speech.
- Ngrok (or similar): To expose your local port 8080 to Twilio.
Setup
-
Install Dependencies:
bash pip install -r scripts/requirements.txt -
Set Environment Variables (in
~/.moltbot/.env,~/.clawdbot/.env, or export):
bash export DEEPGRAM_API_KEY="your_key" export OPENAI_API_KEY="your_key" export ELEVENLABS_API_KEY="your_key" export TWILIO_ACCOUNT_SID="your_sid" export TWILIO_AUTH_TOKEN="your_token" export PORT=8080Optional - System Prompt Customization (priority: file > env var > built-in):
```bashOption 1: Load from file
export SYSTEM_PROMPT_FILE="/path/to/custom-prompt.txt"
Option 2: Set directly via env var
export SYSTEM_PROMPT="You are a helpful phone assistant. Be concise and friendly."
Option 3: Use built-in defaults with name customization
export AGENT_NAME="Niemand"
export OWNER_NAME="Martin's"
``` -
Start the Server:
bash python3 scripts/server.py -
Expose to Internet:
bash ngrok http 8080 -
Configure Twilio:
- Go to your Phone Number settings.
- Set "Voice & Fax" -> "A Call Comes In" to Webhook.
- URL:
https://<your-ngrok-url>.ngrok.io/incoming - Method:
POST
Usage
Call your Twilio number. The agent should answer, transcribe your speech, think, and reply in a natural voice.
Customization
- System Prompt: Configure via
SYSTEM_PROMPT_FILE(load from file),SYSTEM_PROMPT(env var), or modify the built-in defaults withAGENT_NAMEandOWNER_NAME. - Voice: Change
ELEVENLABS_VOICE_IDto use different voices. - Model: Switch
gpt-4o-minitogpt-4for smarter (but slower) responses. - Language: Set
AGENT_LANGUAGEtoenordefor English or German.
# README.md
Phone Agent Moltbot Skill
A real-time AI voice agent that handles incoming phone calls using Twilio, transcribes speech with Deepgram, generates responses via OpenAI, and speaks back with ElevenLabs text-to-speech.
Features
- Real-time Voice Processing: Handles incoming Twilio calls with low-latency WebSocket audio
- Automatic Speech Recognition: Deepgram for fast, accurate transcription
- AI-Powered Responses: OpenAI GPT for intelligent conversation
- Natural Speech Output: ElevenLabs for realistic, streaming TTS
- Task-Based Automation: Configurable task definitions for specific agent behaviors
- Recording & Logging: Automatic call recording and conversation logs
Architecture
Incoming Call (Twilio Phone)
|
v
Twilio WebSocket (Audio Stream)
|
+---> Local FastAPI Server
| |
| +---> Deepgram (Speech-to-Text)
| |
| +---> OpenAI (LLM/Intelligence)
| |
| +---> ElevenLabs (Text-to-Speech)
| |
+---------- (Audio Response)
|
Phone Speaker Output
Prerequisites
Before you begin, ensure you have:
- Twilio Account
- Active Twilio account with a phone number
- TwiML App configured
-
Account SID and Auth Token
-
API Keys (free tier available for all)
- Deepgram API Key (https://console.deepgram.com/)
- OpenAI API Key (https://platform.openai.com/api-keys)
-
ElevenLabs API Key (https://elevenlabs.io/)
-
Local Network Access
- Ngrok or similar tool to expose localhost to the internet
-
Ability to accept incoming webhooks from Twilio
-
Python 3.9+ and pip
Installation
# Clone the repository
git clone https://github.com/kesslerio/phone-agent-moltbot-skill.git
cd phone-agent-moltbot-skill
# Install dependencies
pip install -r scripts/requirements.txt
Configuration
Set Environment Variables
Create a .env file or set environment variables:
# API Keys (required)
export DEEPGRAM_API_KEY="your-deepgram-key"
export OPENAI_API_KEY="your-openai-key"
export ELEVENLABS_API_KEY="your-elevenlabs-key"
# Twilio (required)
export TWILIO_ACCOUNT_SID="your-account-sid"
export TWILIO_AUTH_TOKEN="your-auth-token"
export TWILIO_PHONE_NUMBER="+18665515246" # Your Twilio number
# Server (optional)
export PORT=8080
export PUBLIC_URL="https://your-ngrok-url.ngrok.io" # For webhooks
# Voice Customization (optional)
export ELEVENLABS_VOICE_ID="onwK4e9ZLuTAKqWW03F9" # Daniel voice
# System Prompt Configuration (optional)
export SYSTEM_PROMPT_FILE="/path/to/custom-prompt.txt" # Load prompt from file
export SYSTEM_PROMPT_FILE_REQUIRED="true" # Exit if file missing (default: false)
export SYSTEM_PROMPT="Custom prompt text here" # Override built-in prompt
Template Variables: When using SYSTEM_PROMPT_FILE, you can include these placeholders:
- {agent_name} - Replaced with AGENT_NAME env var (default: "Assistant")
- {owner_name} - Replaced with OWNER_NAME env var (default: "your")
- {language} - Replaced with AGENT_LANGUAGE env var (default: "en")
Example custom prompt file:
You are {agent_name}, {owner_name} personal assistant.
Speak in {language} with precision and clarity.
Or add to ~/.moltbot/.env or ~/.clawdbot/.env:
DEEPGRAM_API_KEY=your-key
OPENAI_API_KEY=your-key
ELEVENLABS_API_KEY=your-key
TWILIO_ACCOUNT_SID=your-sid
TWILIO_AUTH_TOKEN=your-token
TWILIO_PHONE_NUMBER=+1...
Startup & Configuration
1. Start the Local Server
python3 scripts/server.py
The server will start on http://localhost:8080 by default.
2. Expose to Internet with Ngrok
In another terminal:
ngrok http 8080
Note the HTTPS URL (e.g., https://abc123.ngrok.io)
3. Configure Twilio Webhook
In Twilio Console:
- Go to Phone Numbers → Your number
- Under Voice & Fax:
- Set "A Call Comes In" to Webhook
- URL:
https://<your-ngrok-url>.ngrok.io/incoming - Method:
POST - Save
4. Test Incoming Calls
Call your Twilio number. The agent will:
1. Answer and greet you
2. Listen to your speech
3. Transcribe your words
4. Generate a response via OpenAI
5. Speak the response back to you
Customization
Change Agent Persona
Edit SYSTEM_PROMPT in scripts/server.py:
SYSTEM_PROMPT = """You are a helpful customer service agent. Be friendly, concise, and professional."""
Change Voice
Set a different ElevenLabs voice ID:
export ELEVENLABS_VOICE_ID="g1r0eKKcGkk7Ep0RVcVn" # Callum voice
Available ElevenLabs voices: https://elevenlabs.io/docs/getting-started/voices
Use Different Model
Edit scripts/server.py and change the OpenAI model:
response = await client.chat.completions.create(
model="gpt-4", # or "gpt-4-turbo" for faster responses
messages=messages,
)
Task-Based Behaviors
Create YAML task definitions in the tasks/ directory:
name: book_restaurant
description: "Help the user book a restaurant reservation"
system_prompt: "You are a friendly restaurant reservation assistant..."
actions:
- confirm_date
- confirm_time
- confirm_party_size
- book_reservation
Integration with Moltbot
Add this skill to your Moltbot configuration:
{
"skills": [
{
"name": "phone-agent",
"path": "/path/to/phone-agent-moltbot-skill",
"enabled": true
}
]
}
Then reference it in workflows:
- "Set up an incoming voice agent"
- "Configure a customer service chatbot"
- "Test voice AI capabilities"
Project Structure
phone-agent-moltbot-skill/
├── scripts/
│ ├── server.py # Main FastAPI server
│ ├── server_realtime.py # Realtime processing variant
│ ├── requirements.txt # Python dependencies
│ └── typing_sound.raw # Typing sound effect
├── tasks/
│ ├── book_restaurant.yaml # Example task definitions
│ └── get_quote.yaml # Example task definitions
├── calls/ # Recording storage directory
├── references/ # Supporting documentation
├── SKILL.md # Moltbot skill manifest
├── README.md # This file
└── LICENSE # MIT License
Troubleshooting
Server Won't Start
- Check Python version:
python3 --version(requires 3.9+) - Install dependencies:
pip install -r scripts/requirements.txt - Check PORT variable:
echo $PORT(should be 8080 or set value)
Twilio Webhook Not Connecting
- Verify ngrok is running and the URL matches your Twilio webhook
- Check server logs:
python3 scripts/server.py(should show incoming requests) - Test ngrok tunnel:
curl https://<your-ngrok-url>.ngrok.io/health
Poor Transcription Quality
- Ensure DEEPGRAM_API_KEY is valid
- Check microphone/audio quality on the calling phone
- Deepgram is very accurate; poor results indicate audio issues
Slow Responses
- OpenAI API latency varies; gpt-4o-mini is fast and cheap
- Switch to "gpt-3.5-turbo" for faster responses (less capable)
- Increase timeout in websocket settings if needed
Voice Not Speaking
- Verify ELEVENLABS_API_KEY is valid
- Check voice ID is correct: https://elevenlabs.io/docs/api-reference/voices
- Confirm audio is not muted on the receiving phone
API Reference
Incoming Call Webhook
POST /incoming
Twilio sends call information to this endpoint. The server responds with TwiML to establish WebSocket connection.
WebSocket Audio Stream
WS /ws
Bidirectional audio stream for incoming call processing.
Health Check
GET /health
Returns {"status": "ok"} if the server is running.
Performance & Scaling
Current implementation handles:
- Single concurrent call per server instance
- ~100ms RTT for transcription + LLM + TTS
- Suitable for demo/testing, hobby projects, and low-volume use
For production:
- Run multiple server instances behind a load balancer
- Use Twilio's call queuing
- Implement connection pooling for API clients
- Consider dedicated hardware for Deepgram/ElevenLabs processing
Deployment Options
Local Development
python3 scripts/server.py
ngrok http 8080
Docker
FROM python:3.11-slim
WORKDIR /app
COPY scripts/requirements.txt .
RUN pip install -r requirements.txt
COPY scripts/ .
CMD ["python3", "server.py"]
Build and run:
docker build -t phone-agent .
docker run -p 8080:8080 \
-e DEEPGRAM_API_KEY="..." \
-e OPENAI_API_KEY="..." \
-e ELEVENLABS_API_KEY="..." \
-e TWILIO_ACCOUNT_SID="..." \
-e TWILIO_AUTH_TOKEN="..." \
phone-agent
Cloud Deployment
- Heroku: Add
Procfile→web: python3 scripts/server.py - Railway.app: Auto-detects Python and builds
- AWS Lambda: Use WebSocket API Gateway + Lambda
- Google Cloud Run: Containerize and deploy
License
MIT
Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Test thoroughly
- Submit a pull request
Support
- MCP Server: Deepgram | OpenAI | ElevenLabs
- Twilio Docs: Voice API
- Moltbot: Documentation
Requirements
ffmpegmust be in PATH (for converting ElevenLabs MP3 to Twilio mu-law audio)- If running as a systemd service, ensure PATH includes ffmpeg location:
ini Environment=PATH=/home/art/.nix-profile/bin:/usr/bin:/bin
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.