Use when adding new error messages to React, or seeing "unknown error code" warnings.
npx skills add Johnny-xuan/smart-voice-chat
Or install specific skill: npx add-skill https://github.com/Johnny-xuan/smart-voice-chat
# Description
Voice conversation: transcribe voice input, reply in same format (voice-to-voice, text-to-text) by default. User can override with voice or text commands.
# SKILL.md
name: smart-voice-chat
description: "Voice conversation: transcribe voice input, reply in same format (voice-to-voice, text-to-text) by default. User can override with voice or text commands."
metadata: {"clawdbot":{"emoji":"🗣️","os":["darwin","linux"],"requires":{"anyBins":["ffmpeg"],"python":["sherpa-onnx","yaml"]}}}
SmartVoice Chat 🗣️
Intelligent voice conversation system with automatic format detection and mirrored response mode.
🎯 How It Works
Core Principle: Reply in the same format as the input (voice → voice, text → text)
Voice Input → STT → AI Process → TTS → Voice Reply
Text Input → AI Process → Text Reply
📋 Workflow for AI Agents
Step 1: Detect Input Format
When receiving a message from the user:
Check if message contains:
- Voice attachment (audio/* mime type)
- Audio file path (.wav, .ogg, .mp3, .m4a)
If voice detected:
~/.clawdbot/skills/smart-voice-chat/bin/stt.py <audio_file>
If text:
- Use the text directly
Step 2: Parse Intent
Check the transcribed text for keywords:
- "用语音回答", "读出来", "说一下" → Voice-only output
- "用文字回答", "不用读", "只显示" → Text-only output
- Default → Use same format as input (voice → voice, text → text)
Step 3: Process with AI
Use the transcribed/cleaned text as the user's actual message for AI processing.
Step 4: Generate Response
For voice output or dual mode:
~/.clawdbot/skills/smart-voice-chat/bin/tts.py \
"<AI_RESPONSE_TEXT>" \
/tmp/smart-voice-chat/response_<timestamp>
Note: TTS will automatically output .ogg format (Telegram voice message compatible)
Then attach the audio file to the reply:
- For Telegram: Use audioAsVoice: true with mediaUrl (.ogg file)
- For iMessage: Attach the .ogg file (Telegram compatible format)
- For other channels: Attach based on channel capabilities
For text-only mode:
- Send text only, no audio attachment
💡 Usage Examples
Example 1: Voice Input → Voice Reply (Default)
You: [Send voice message] "今天天气怎么样"
AI:
1. Detects voice attachment
2. Runs STT → "今天天气怎么样"
3. Processes AI → "今天晴天,气温25度"
4. Runs TTS → generates .wav file
5. Sends: "今天晴天,气温25度" + voice attachment
Example 2: Text Input → Text Reply (Default)
You: "今天天气怎么样"
AI:
1. Detects text input
2. Processes AI → "今天晴天,气温25度"
3. Sends: "今天晴天,气温25度" (text only)
Example 3: Voice Input → Text Reply (Special Request)
You: [Send voice] "用文字回答:今天几点了"
AI:
1. Detects voice attachment
2. Runs STT → "用文字回答:今天几点了"
3. Parses intent → Text-only mode
4. Cleans text → "今天几点了"
5. Processes AI → "现在是下午4点"
6. Sends: "现在是下午4点" (text only, no voice)
Example 4: Text Input → Voice Reply (Special Request)
You: "用语音回答:明天会下雨吗"
AI:
1. Detects text input
2. Parses intent → Voice-only mode
3. Cleans text → "明天会下雨吗"
4. Processes AI → "明天可能有小雨"
5. Runs TTS → generates .wav file
6. Sends: voice attachment
🔧 Configuration
Edit ~/.clawdbot/skills/smart-voice-chat/config/config.yaml:
# Default behavior
voice:
input_mode: auto # Auto-detect input type
output_mode: mirror # mirror = same format as input
auto_play: false # Let Clawdbot handle playback
# STT settings
stt:
model_path: ~/.clawdbot/sherpa-asr/models/sherpa-onnx-paraformer-zh-2024-03-09
language: zh-en
# TTS settings
tts:
model_path: ~/.clawdbot/tools/sherpa-onnx-tts/models/vits-melo-tts-zh_en
output_dir: /tmp/smart-voice-chat
🎨 Reply Format
For Telegram
{
"text": "AI response text",
"mediaUrl": "/tmp/smart-voice-chat/response_xxx.ogg",
"audioAsVoice": true
}
For iMessage
{
"text": "AI response text",
"attachments": [
{
"original_path": "/tmp/smart-voice-chat/response_xxx.ogg"
}
]
}
⚠️ Important Notes
- Default mode is "mirror": Reply in same format as input
- Always transcribe voice first: Don't process raw audio files
- Clean intent keywords: Remove "用语音回答" etc. before AI processing
- Generate unique filenames: Use timestamp or random ID for TTS output
- Handle STT failures: If transcription fails, ask user to repeat
📊 Supported Audio Formats
Input: .wav, .mp3, .m4a, .ogg, .opus, .flac
Output: .ogg (OPUS encoded, Telegram voice message compatible)
TL;DR: Auto-detect input format → Process with AI → Reply in same format (unless user requests otherwise)
# README.md
SmartVoice Chat 🗣️
Intelligent Voice Conversation Skill for Clawdbot
Offline voice-to-voice interaction powered by Sherpa-ONNX. Automatically detects voice/text input and replies in the same format.
✨ Features
- 🎯 Auto Detection - Automatically detects voice vs text input
- 🗣️ Voice-to-Voice - Replies in same format (voice→voice, text→text)
- 🌏 Chinese-English - Native mixed language support
- 🔒 Fully Offline - No cloud, privacy-preserving
- 📱 Telegram Ready - Outputs OGG format for voice messages
📋 System Requirements
Operating System
- macOS: 11.0+ (Big Sur or later)
- Linux: Ubuntu 20.04+, Debian 11+, or equivalent
- Arch: x86_64 or ARM64 (Apple Silicon supported)
Dependencies
| Component | Version | Installation |
|---|---|---|
| Clawdbot | Latest | npm install -g clawdbot@latest |
| Node.js | 22+ | brew install node (macOS) or nvm install 22 |
| Python | 3.8+ | brew install python3 (macOS) or apt install python3 |
| FFmpeg | 4.0+ | brew install ffmpeg (macOS) or apt install ffmpeg |
| pip3 | Latest | Included with Python 3 |
Hardware Requirements
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 4 GB | 8 GB+ |
| Storage | 2 GB free | 4 GB+ free (for models) |
| CPU | Any modern CPU | Apple Silicon M1/M2/M3 or Intel Core i5+ |
Network Requirements
- Required: For downloading models and dependencies (initial setup only)
- Runtime: Fully offline after installation
🔧 Sherpa-ONNX Models
STT (Speech-to-Text) Options
| Model | Language | Size | Speed | Accuracy |
|---|---|---|---|---|
| sherpa-onnx-paraformer-zh-2024-03-09 | Chinese | 950MB | Fast | High |
| sherpa-onnx-streaming-zh-en-2024-03-12 | Chinese-English | 490MB | Very Fast | Medium |
Recommended: sherpa-onnx-paraformer-zh-2024-03-09 for best accuracy.
TTS (Text-to-Speech) Options
Chinese-English (Recommended)
| Model | Voice | Size | Download |
|---|---|---|---|
| vits-melo-tts-zh_en | Female | 163MB | Download |
| vits-piper-zh_CN-huayan | Female | 300MB | Download |
English Voices
| Model | Accent | Voice | Size | Download |
|---|---|---|---|---|
| vits-piper-en_US-lessac-high | American | Male | 500MB | Download |
| vits-piper-en_US-glados | American | Female | 300MB | Download |
| vits-piper-en_GB-semaine | British | Female | 300MB | Download |
| vits-piper-en_GB-lessac | British | Male | 500MB | Download |
Japanese
| Model | Voice | Size | Download |
|---|---|---|---|
| vits-vctk | Multi-speaker | 500MB+ | Download |
Spanish
| Model | Voice | Size | Download |
|---|---|---|---|
| vits-piper-es_ES-vox | Female | 300MB | Download |
French
| Model | Voice | Size | Download |
|---|---|---|---|
| vits-piper-fr_FR-siwis | Female | 300MB | Download |
German
| Model | Voice | Size | Download |
|---|---|---|---|
| vits-piper-de_DE-thorsten-medium | Male | 300MB | Download |
Kokoro Multi-Language (103+ Speakers)
| Model | Languages | Speakers | Size |
|---|---|---|---|
| kokoro-multi-lang-v1_0 | Multi | 103+ | Large |
Recommended: vits-melo-tts-zh_en for mixed Chinese-English.
🎧 Try Before You Download
Listen to samples at the Sherpa-ONNX Text-to-Speech Space on HuggingFace.
📚 More Languages and Models
Sherpa-ONNX supports 40+ languages and 100+ pre-trained models:
- Full Model List: https://k2-fsa.github.io/sherpa/onnx/tts/all/
- VITS Models: https://k2-fsa.github.io/sherpa/onnx/tts/pretrained_models/vits.html
- GitHub Releases: https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models
Supported languages include: Arabic, Bulgarian, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian, Vietnamese, and more.
📦 Installation
Quick Install (Recommended)
- Clone this repository
git clone https://github.com/Johnny-xuan/smart-voice-chat.git ~/smart-voice-chat
- Download Sherpa-ONNX runtime
# macOS
curl -L https://github.com/k2-fsa/sherpa-onnx/releases/download/v1.12.23/sherpa-onnx-v1.12.23-osx-universal2-shared.tar.bz2 | tar xjf -
mkdir -p ~/.clawdbot/sherpa-asr/runtime
mv sherpa-onnx*/* ~/.clawdbot/sherpa-asr/runtime/
- Download models
# STT Model (Chinese)
mkdir -p ~/.clawdbot/sherpa-asr/models
cd ~/.clawdbot/sherpa-asr/models
curl -L https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-paraformer-zh-2024-03-09.tar.bz2 | tar xjf -
# TTS Model (Chinese-English)
mkdir -p ~/.clawdbot/tools/sherpa-onnx-tts/models
cd ~/.clawdbot/tools/sherpa-onnx-tts/models
curl -L https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-melo-tts-zh_en.tar.bz2 | tar xjf -
- Install to Clawdbot skills
# Find Clawdbot skills directory
CLAWDBOT_SKILLS=$(npm root -g)/clawdbot@*/node_modules/clawdbot/skills
# Copy skill
cp -r ~/smart-voice-chat "$CLAWDBOT_SKILLS/"
# Verify
clawdbot skills list | grep smart-voice
Expected output: │ ✓ ready │ 🗣️ smart-voice- │ ...
- Restart Clawdbot
pkill -9 clawdbot
clawdbot-gateway &
⚙️ Configuration
Environment Variables (Optional)
Add to ~/.clawdbot/clawdbot.json:
{
"skills": {
"entries": {
"smart-voice-chat": {
"env": {
"SMART_VOICE_CHAT_STT_MODEL": "~/.clawdbot/sherpa-asr/models/sherpa-onnx-paraformer-zh-2024-03-09",
"SMART_VOICE_CHAT_TTS_MODEL": "~/.clawdbot/tools/sherpa-onnx-tts/models/vits-melo-tts-zh_en",
"SMART_VOICE_CHAT_OUTPUT_DIR": "/tmp/smart-voice-chat"
}
}
}
}
}
SKILL.md Format
The SKILL.md frontmatter must use valid JSON:
---
name: smart-voice-chat
description: "Voice conversation with auto-detection (voice-to-voice, text-to-text)"
metadata: {"clawdbot":{"emoji":"🗣️","os":["darwin","linux"],"requires":{"anyBins":["ffmpeg"]}}}
---
Important:
- Always wrap description in quotes
- Avoid special characters (use to instead of →)
- requires only supports: bins, anyBins, env, config
- Does NOT support: python field (ignored by Clawdbot)
💡 Usage
Default Mirror Mode
You: [Voice message] "What's the weather like today?"
AI: [Voice + Text] "It's sunny today, 25°C"
You: "What's the weather like today?"
AI: "It's sunny today, 25°C" [Text only]
Override with Keywords
You: "Reply with voice: Will it rain tomorrow?"
AI: [Voice only] "It might rain lightly tomorrow"
You: "Reply with text: What time is it now?"
AI: [Text only] "It's 4 PM now"
🔧 Tech Stack
| Component | Technology |
|---|---|
| STT | Sherpa-ONNX Paraformer (zh-en) |
| TTS | Sherpa-ONNX VITS-Melo (zh-en) |
| Audio | FFmpeg (WAV → OGG/OPUS) |
| Language | Python 3 + Bash |
🐛 Troubleshooting
Skill not showing in clawdbot skills list
- Check SKILL.md syntax:
head -5 ~/.clawdbot/skills/smart-voice-chat/SKILL.md
- Verify FFmpeg is installed:
which ffmpeg
- Check logs:
tail -50 ~/.clawdbot/logs/gateway.err.log
OGG conversion fails
Install FFmpeg:
brew install ffmpeg # macOS
sudo apt install ffmpeg # Ubuntu/Debian
📄 License
MIT License - see LICENSE
🙏 Acknowledgments
- Clawdbot - AI Agent Framework
- Sherpa-ONNX - Offline speech processing
- Paraformer - Alibaba's ASR model
- VITS-Melo - MyShell's TTS model
Author: Johnny
GitHub: smart-voice-chat
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.