smart-voice-chat

by @Johnny-xuan in Development

# Install this skill:

npx skills add Johnny-xuan/smart-voice-chat

Or install specific skill: npx add-skill https://github.com/Johnny-xuan/smart-voice-chat

# Description

Voice conversation: transcribe voice input, reply in same format (voice-to-voice, text-to-text) by default. User can override with voice or text commands.

# SKILL.md

name: smart-voice-chat
description: "Voice conversation: transcribe voice input, reply in same format (voice-to-voice, text-to-text) by default. User can override with voice or text commands."
metadata: {"clawdbot":{"emoji":"🗣️","os":["darwin","linux"],"requires":{"anyBins":["ffmpeg"],"python":["sherpa-onnx","yaml"]}}}

SmartVoice Chat 🗣️

Intelligent voice conversation system with automatic format detection and mirrored response mode.

🎯 How It Works

Core Principle: Reply in the same format as the input (voice → voice, text → text)

Voice Input → STT → AI Process → TTS → Voice Reply
Text Input  → AI Process → Text Reply

📋 Workflow for AI Agents

Step 1: Detect Input Format

When receiving a message from the user:

Check if message contains:
- Voice attachment (audio/* mime type)
- Audio file path (.wav, .ogg, .mp3, .m4a)

If voice detected:

~/.clawdbot/skills/smart-voice-chat/bin/stt.py <audio_file>

If text:
- Use the text directly

Step 2: Parse Intent

Check the transcribed text for keywords:
- "用语音回答", "读出来", "说一下" → Voice-only output
- "用文字回答", "不用读", "只显示" → Text-only output
- Default → Use same format as input (voice → voice, text → text)

Step 3: Process with AI

Use the transcribed/cleaned text as the user's actual message for AI processing.

Step 4: Generate Response

For voice output or dual mode:

~/.clawdbot/skills/smart-voice-chat/bin/tts.py \
  "<AI_RESPONSE_TEXT>" \
  /tmp/smart-voice-chat/response_<timestamp>

Note: TTS will automatically output .ogg format (Telegram voice message compatible)

Then attach the audio file to the reply:
- For Telegram: Use audioAsVoice: true with mediaUrl (.ogg file)
- For iMessage: Attach the .ogg file (Telegram compatible format)
- For other channels: Attach based on channel capabilities

For text-only mode:
- Send text only, no audio attachment

💡 Usage Examples

Example 1: Voice Input → Voice Reply (Default)

You: [Send voice message] "今天天气怎么样"

AI:
  1. Detects voice attachment
  2. Runs STT → "今天天气怎么样"
  3. Processes AI → "今天晴天，气温25度"
  4. Runs TTS → generates .wav file
  5. Sends: "今天晴天，气温25度" + voice attachment

Example 2: Text Input → Text Reply (Default)

You: "今天天气怎么样"

AI:
  1. Detects text input
  2. Processes AI → "今天晴天，气温25度"
  3. Sends: "今天晴天，气温25度" (text only)

Example 3: Voice Input → Text Reply (Special Request)

You: [Send voice] "用文字回答：今天几点了"

AI:
  1. Detects voice attachment
  2. Runs STT → "用文字回答：今天几点了"
  3. Parses intent → Text-only mode
  4. Cleans text → "今天几点了"
  5. Processes AI → "现在是下午4点"
  6. Sends: "现在是下午4点" (text only, no voice)

Example 4: Text Input → Voice Reply (Special Request)

You: "用语音回答：明天会下雨吗"

AI:
  1. Detects text input
  2. Parses intent → Voice-only mode
  3. Cleans text → "明天会下雨吗"
  4. Processes AI → "明天可能有小雨"
  5. Runs TTS → generates .wav file
  6. Sends: voice attachment

🔧 Configuration

Edit ~/.clawdbot/skills/smart-voice-chat/config/config.yaml:

# Default behavior
voice:
  input_mode: auto          # Auto-detect input type
  output_mode: mirror       # mirror = same format as input
  auto_play: false          # Let Clawdbot handle playback

# STT settings
stt:
  model_path: ~/.clawdbot/sherpa-asr/models/sherpa-onnx-paraformer-zh-2024-03-09
  language: zh-en

# TTS settings
tts:
  model_path: ~/.clawdbot/tools/sherpa-onnx-tts/models/vits-melo-tts-zh_en
  output_dir: /tmp/smart-voice-chat

🎨 Reply Format

For Telegram

{
  "text": "AI response text",
  "mediaUrl": "/tmp/smart-voice-chat/response_xxx.ogg",
  "audioAsVoice": true
}

For iMessage

{
  "text": "AI response text",
  "attachments": [
    {
      "original_path": "/tmp/smart-voice-chat/response_xxx.ogg"
    }
  ]
}

⚠️ Important Notes

Default mode is "mirror": Reply in same format as input
Always transcribe voice first: Don't process raw audio files
Clean intent keywords: Remove "用语音回答" etc. before AI processing
Generate unique filenames: Use timestamp or random ID for TTS output
Handle STT failures: If transcription fails, ask user to repeat

📊 Supported Audio Formats

Input: .wav, .mp3, .m4a, .ogg, .opus, .flac
Output: .ogg (OPUS encoded, Telegram voice message compatible)

TL;DR: Auto-detect input format → Process with AI → Reply in same format (unless user requests otherwise)

# README.md

SmartVoice Chat 🗣️

Intelligent Voice Conversation Skill for Clawdbot

Offline voice-to-voice interaction powered by Sherpa-ONNX. Automatically detects voice/text input and replies in the same format.

✨ Features

🎯 Auto Detection - Automatically detects voice vs text input
🗣️ Voice-to-Voice - Replies in same format (voice→voice, text→text)
🌏 Chinese-English - Native mixed language support
🔒 Fully Offline - No cloud, privacy-preserving
📱 Telegram Ready - Outputs OGG format for voice messages

📋 System Requirements

Operating System

macOS: 11.0+ (Big Sur or later)
Linux: Ubuntu 20.04+, Debian 11+, or equivalent
Arch: x86_64 or ARM64 (Apple Silicon supported)

Dependencies

Component	Version	Installation
Clawdbot	Latest	`npm install -g clawdbot@latest`
Node.js	22+	`brew install node` (macOS) or `nvm install 22`
Python	3.8+	`brew install python3` (macOS) or `apt install python3`
FFmpeg	4.0+	`brew install ffmpeg` (macOS) or `apt install ffmpeg`
pip3	Latest	Included with Python 3

Hardware Requirements

Component	Minimum	Recommended
RAM	4 GB	8 GB+
Storage	2 GB free	4 GB+ free (for models)
CPU	Any modern CPU	Apple Silicon M1/M2/M3 or Intel Core i5+

Network Requirements

Required: For downloading models and dependencies (initial setup only)
Runtime: Fully offline after installation

🔧 Sherpa-ONNX Models

STT (Speech-to-Text) Options

Model	Language	Size	Speed	Accuracy
sherpa-onnx-paraformer-zh-2024-03-09	Chinese	950MB	Fast	High
sherpa-onnx-streaming-zh-en-2024-03-12	Chinese-English	490MB	Very Fast	Medium

Recommended: sherpa-onnx-paraformer-zh-2024-03-09 for best accuracy.

TTS (Text-to-Speech) Options

Chinese-English (Recommended)

Model	Voice	Size	Download
vits-melo-tts-zh_en	Female	163MB	Download
vits-piper-zh_CN-huayan	Female	300MB	Download

English Voices

Model	Accent	Voice	Size	Download
vits-piper-en_US-lessac-high	American	Male	500MB	Download
vits-piper-en_US-glados	American	Female	300MB	Download
vits-piper-en_GB-semaine	British	Female	300MB	Download
vits-piper-en_GB-lessac	British	Male	500MB	Download

Japanese

Model	Voice	Size	Download
vits-vctk	Multi-speaker	500MB+	Download

Spanish

Model	Voice	Size	Download
vits-piper-es_ES-vox	Female	300MB	Download

French

Model	Voice	Size	Download
vits-piper-fr_FR-siwis	Female	300MB	Download

German

Model	Voice	Size	Download
vits-piper-de_DE-thorsten-medium	Male	300MB	Download

Kokoro Multi-Language (103+ Speakers)

Model	Languages	Speakers	Size
kokoro-multi-lang-v1_0	Multi	103+	Large

Recommended: vits-melo-tts-zh_en for mixed Chinese-English.

🎧 Try Before You Download

Listen to samples at the Sherpa-ONNX Text-to-Speech Space on HuggingFace.

📚 More Languages and Models

Sherpa-ONNX supports 40+ languages and 100+ pre-trained models:

Full Model List: https://k2-fsa.github.io/sherpa/onnx/tts/all/
VITS Models: https://k2-fsa.github.io/sherpa/onnx/tts/pretrained_models/vits.html
GitHub Releases: https://github.com/k2-fsa/sherpa-onnx/releases/tag/tts-models

Supported languages include: Arabic, Bulgarian, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian, Vietnamese, and more.

📦 Installation

Quick Install (Recommended)

Clone this repository

git clone https://github.com/Johnny-xuan/smart-voice-chat.git ~/smart-voice-chat

Download Sherpa-ONNX runtime

# macOS
curl -L https://github.com/k2-fsa/sherpa-onnx/releases/download/v1.12.23/sherpa-onnx-v1.12.23-osx-universal2-shared.tar.bz2 | tar xjf -
mkdir -p ~/.clawdbot/sherpa-asr/runtime
mv sherpa-onnx*/* ~/.clawdbot/sherpa-asr/runtime/

Download models

# STT Model (Chinese)
mkdir -p ~/.clawdbot/sherpa-asr/models
cd ~/.clawdbot/sherpa-asr/models
curl -L https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-paraformer-zh-2024-03-09.tar.bz2 | tar xjf -

# TTS Model (Chinese-English)
mkdir -p ~/.clawdbot/tools/sherpa-onnx-tts/models
cd ~/.clawdbot/tools/sherpa-onnx-tts/models
curl -L https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-melo-tts-zh_en.tar.bz2 | tar xjf -

Install to Clawdbot skills

# Find Clawdbot skills directory
CLAWDBOT_SKILLS=$(npm root -g)/clawdbot@*/node_modules/clawdbot/skills

# Copy skill
cp -r ~/smart-voice-chat "$CLAWDBOT_SKILLS/"

# Verify
clawdbot skills list | grep smart-voice

Expected output: │ ✓ ready │ 🗣️ smart-voice- │ ...

Restart Clawdbot

pkill -9 clawdbot
clawdbot-gateway &

⚙️ Configuration

Environment Variables (Optional)

Add to ~/.clawdbot/clawdbot.json:

{
  "skills": {
    "entries": {
      "smart-voice-chat": {
        "env": {
          "SMART_VOICE_CHAT_STT_MODEL": "~/.clawdbot/sherpa-asr/models/sherpa-onnx-paraformer-zh-2024-03-09",
          "SMART_VOICE_CHAT_TTS_MODEL": "~/.clawdbot/tools/sherpa-onnx-tts/models/vits-melo-tts-zh_en",
          "SMART_VOICE_CHAT_OUTPUT_DIR": "/tmp/smart-voice-chat"
        }
      }
    }
  }
}

SKILL.md Format

The SKILL.md frontmatter must use valid JSON:

---
name: smart-voice-chat
description: "Voice conversation with auto-detection (voice-to-voice, text-to-text)"
metadata: {"clawdbot":{"emoji":"🗣️","os":["darwin","linux"],"requires":{"anyBins":["ffmpeg"]}}}
---

Important:
- Always wrap description in quotes
- Avoid special characters (use to instead of →)
- requires only supports: bins, anyBins, env, config
- Does NOT support: python field (ignored by Clawdbot)

💡 Usage

Default Mirror Mode

You: [Voice message] "What's the weather like today?"
AI:  [Voice + Text] "It's sunny today, 25°C"

You: "What's the weather like today?"
AI:  "It's sunny today, 25°C" [Text only]

Override with Keywords

You: "Reply with voice: Will it rain tomorrow?"
AI:  [Voice only] "It might rain lightly tomorrow"

You: "Reply with text: What time is it now?"
AI:  [Text only] "It's 4 PM now"

🔧 Tech Stack

Component	Technology
STT	Sherpa-ONNX Paraformer (zh-en)
TTS	Sherpa-ONNX VITS-Melo (zh-en)
Audio	FFmpeg (WAV → OGG/OPUS)
Language	Python 3 + Bash

🐛 Troubleshooting

Skill not showing in `clawdbot skills list`

Check SKILL.md syntax:

head -5 ~/.clawdbot/skills/smart-voice-chat/SKILL.md

Verify FFmpeg is installed:

which ffmpeg

Check logs:

tail -50 ~/.clawdbot/logs/gateway.err.log

OGG conversion fails

Install FFmpeg:

brew install ffmpeg  # macOS
sudo apt install ffmpeg  # Ubuntu/Debian

📄 License

MIT License - see LICENSE

🙏 Acknowledgments

Clawdbot - AI Agent Framework
Sherpa-ONNX - Offline speech processing
Paraformer - Alibaba's ASR model
VITS-Melo - MyShell's TTS model

Author: Johnny
GitHub: smart-voice-chat

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.

smart-voice-chat

# Description

# SKILL.md

SmartVoice Chat 🗣️

🎯 How It Works

📋 Workflow for AI Agents

Step 1: Detect Input Format

Step 2: Parse Intent

Step 3: Process with AI

Step 4: Generate Response

💡 Usage Examples

Example 1: Voice Input → Voice Reply (Default)

Example 2: Text Input → Text Reply (Default)

Example 3: Voice Input → Text Reply (Special Request)

Example 4: Text Input → Voice Reply (Special Request)

🔧 Configuration

🎨 Reply Format

For Telegram

For iMessage

⚠️ Important Notes

📊 Supported Audio Formats

# README.md

SmartVoice Chat 🗣️

✨ Features

📋 System Requirements

Operating System

Dependencies

Hardware Requirements

Network Requirements

🔧 Sherpa-ONNX Models

STT (Speech-to-Text) Options

TTS (Text-to-Speech) Options

Chinese-English (Recommended)

English Voices

Japanese

Spanish

French

German

Kokoro Multi-Language (103+ Speakers)

🎧 Try Before You Download

📚 More Languages and Models

📦 Installation

Quick Install (Recommended)

⚙️ Configuration

Environment Variables (Optional)

SKILL.md Format

💡 Usage

Default Mirror Mode

Override with Keywords

🔧 Tech Stack

🐛 Troubleshooting

Skill not showing in clawdbot skills list

OGG conversion fails

📄 License

🙏 Acknowledgments

# Related Skills

# Supported AI Coding Agents

Confirm

Submit a Skill

Skill not showing in `clawdbot skills list`