Johnny-xuan

smart-voice-chat

0
0
# Install this skill:
npx skills add Johnny-xuan/smart-voice-chat

Or install specific skill: npx add-skill https://github.com/Johnny-xuan/smart-voice-chat

# Description

Voice conversation: transcribe voice input, reply in same format (voice-to-voice, text-to-text) by default. User can override with voice or text commands.

# SKILL.md


name: smart-voice-chat
description: "Voice conversation: transcribe voice input, reply in same format (voice-to-voice, text-to-text) by default. User can override with voice or text commands."
metadata: {"clawdbot":{"emoji":"🗣️","os":["darwin","linux"],"requires":{"anyBins":["ffmpeg"],"python":["sherpa-onnx","yaml"]}}}


SmartVoice Chat 🗣️

Intelligent voice conversation system with automatic format detection and mirrored response mode.

🎯 How It Works

Core Principle: Reply in the same format as the input (voice → voice, text → text)

Voice Input → STT → AI Process → TTS → Voice Reply
Text Input  → AI Process → Text Reply

📋 Workflow for AI Agents

Step 1: Detect Input Format

When receiving a message from the user:

Check if message contains:
- Voice attachment (audio/* mime type)
- Audio file path (.wav, .ogg, .mp3, .m4a)

If voice detected:

~/.clawdbot/skills/smart-voice-chat/bin/stt.py <audio_file>

If text:
- Use the text directly

Step 2: Parse Intent

Check the transcribed text for keywords:
- "用语音回答", "读出来", "说一下" → Voice-only output
- "用文字回答", "不用读", "只显示" → Text-only output
- Default → Use same format as input (voice → voice, text → text)

Step 3: Process with AI

Use the transcribed/cleaned text as the user's actual message for AI processing.

Step 4: Generate Response

For voice output or dual mode:

~/.clawdbot/skills/smart-voice-chat/bin/tts.py \
  "<AI_RESPONSE_TEXT>" \
  /tmp/smart-voice-chat/response_<timestamp>

Note: TTS will automatically output .ogg format (Telegram voice message compatible)

Then attach the audio file to the reply:
- For Telegram: Use audioAsVoice: true with mediaUrl (.ogg file)
- For iMessage: Attach the .ogg file (Telegram compatible format)
- For other channels: Attach based on channel capabilities

For text-only mode:
- Send text only, no audio attachment

💡 Usage Examples

Example 1: Voice Input → Voice Reply (Default)

You: [Send voice message] "今天天气怎么样"

AI:
  1. Detects voice attachment
  2. Runs STT → "今天天气怎么样"
  3. Processes AI → "今天晴天,气温25度"
  4. Runs TTS → generates .wav file
  5. Sends: "今天晴天,气温25度" + voice attachment

Example 2: Text Input → Text Reply (Default)

You: "今天天气怎么样"

AI:
  1. Detects text input
  2. Processes AI → "今天晴天,气温25度"
  3. Sends: "今天晴天,气温25度" (text only)

Example 3: Voice Input → Text Reply (Special Request)

You: [Send voice] "用文字回答:今天几点了"

AI:
  1. Detects voice attachment
  2. Runs STT → "用文字回答:今天几点了"
  3. Parses intent → Text-only mode
  4. Cleans text → "今天几点了"
  5. Processes AI → "现在是下午4点"
  6. Sends: "现在是下午4点" (text only, no voice)

Example 4: Text Input → Voice Reply (Special Request)

You: "用语音回答:明天会下雨吗"

AI:
  1. Detects text input
  2. Parses intent → Voice-only mode
  3. Cleans text → "明天会下雨吗"
  4. Processes AI → "明天可能有小雨"
  5. Runs TTS → generates .wav file
  6. Sends: voice attachment

🔧 Configuration

Edit ~/.clawdbot/skills/smart-voice-chat/config/config.yaml:

# Default behavior
voice:
  input_mode: auto          # Auto-detect input type
  output_mode: mirror       # mirror = same format as input
  auto_play: false          # Let Clawdbot handle playback

# STT settings
stt:
  model_path: ~/.clawdbot/sherpa-asr/models/sherpa-onnx-paraformer-zh-2024-03-09
  language: zh-en

# TTS settings
tts:
  model_path: ~/.clawdbot/tools/sherpa-onnx-tts/models/vits-melo-tts-zh_en
  output_dir: /tmp/smart-voice-chat

🎨 Reply Format

For Telegram

{
  "text": "AI response text",
  "mediaUrl": "/tmp/smart-voice-chat/response_xxx.ogg",
  "audioAsVoice": true
}

For iMessage

{
  "text": "AI response text",
  "attachments": [
    {
      "original_path": "/tmp/smart-voice-chat/response_xxx.ogg"
    }
  ]
}

⚠️ Important Notes

  1. Default mode is "mirror": Reply in same format as input
  2. Always transcribe voice first: Don't process raw audio files
  3. Clean intent keywords: Remove "用语音回答" etc. before AI processing
  4. Generate unique filenames: Use timestamp or random ID for TTS output
  5. Handle STT failures: If transcription fails, ask user to repeat

📊 Supported Audio Formats

Input: .wav, .mp3, .m4a, .ogg, .opus, .flac
Output: .ogg (OPUS encoded, Telegram voice message compatible)


TL;DR: Auto-detect input format → Process with AI → Reply in same format (unless user requests otherwise)

# README.md

SmartVoice Chat 🗣️

License: MIT
Clawdbot Skill
Python 3.8+
Platform
Offline

Intelligent Voice Conversation Skill for Clawdbot

Offline voice-to-voice interaction powered by Sherpa-ONNX. Automatically detects voice/text input and replies in the same format.

✨ Features

  • 🎯 Auto Detection - Automatically detects voice vs text input
  • 🗣️ Voice-to-Voice - Replies in same format (voice→voice, text→text)
  • 🌏 Chinese-English - Native mixed language support
  • 🔒 Fully Offline - No cloud, privacy-preserving
  • 📱 Telegram Ready - Outputs OGG format for voice messages

📋 System Requirements

Operating System

  • macOS: 11.0+ (Big Sur or later)
  • Linux: Ubuntu 20.04+, Debian 11+, or equivalent
  • Arch: x86_64 or ARM64 (Apple Silicon supported)

Dependencies

Component Version Installation
Clawdbot Latest npm install -g clawdbot@latest
Node.js 22+ brew install node (macOS) or nvm install 22
Python 3.8+ brew install python3 (macOS) or apt install python3
FFmpeg 4.0+ brew install ffmpeg (macOS) or apt install ffmpeg
pip3 Latest Included with Python 3

Hardware Requirements

Component Minimum Recommended
RAM 4 GB 8 GB+
Storage 2 GB free 4 GB+ free (for models)
CPU Any modern CPU Apple Silicon M1/M2/M3 or Intel Core i5+

Network Requirements

  • Required: For downloading models and dependencies (initial setup only)
  • Runtime: Fully offline after installation

🔧 Sherpa-ONNX Models

STT (Speech-to-Text) Options

Model Language Size Speed Accuracy
sherpa-onnx-paraformer-zh-2024-03-09 Chinese 950MB Fast High
sherpa-onnx-streaming-zh-en-2024-03-12 Chinese-English 490MB Very Fast Medium

Recommended: sherpa-onnx-paraformer-zh-2024-03-09 for best accuracy.

TTS (Text-to-Speech) Options

Model Voice Size Download
vits-melo-tts-zh_en Female 163MB Download
vits-piper-zh_CN-huayan Female 300MB Download

English Voices

Model Accent Voice Size Download
vits-piper-en_US-lessac-high American Male 500MB Download
vits-piper-en_US-glados American Female 300MB Download
vits-piper-en_GB-semaine British Female 300MB Download
vits-piper-en_GB-lessac British Male 500MB Download

Japanese

Model Voice Size Download
vits-vctk Multi-speaker 500MB+ Download

Spanish

Model Voice Size Download
vits-piper-es_ES-vox Female 300MB Download

French

Model Voice Size Download
vits-piper-fr_FR-siwis Female 300MB Download

German

Model Voice Size Download
vits-piper-de_DE-thorsten-medium Male 300MB Download

Kokoro Multi-Language (103+ Speakers)

Model Languages Speakers Size
kokoro-multi-lang-v1_0 Multi 103+ Large

Recommended: vits-melo-tts-zh_en for mixed Chinese-English.

🎧 Try Before You Download

Listen to samples at the Sherpa-ONNX Text-to-Speech Space on HuggingFace.

📚 More Languages and Models

Sherpa-ONNX supports 40+ languages and 100+ pre-trained models:

Supported languages include: Arabic, Bulgarian, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian, Vietnamese, and more.

📦 Installation

  1. Clone this repository
git clone https://github.com/Johnny-xuan/smart-voice-chat.git ~/smart-voice-chat
  1. Download Sherpa-ONNX runtime
# macOS
curl -L https://github.com/k2-fsa/sherpa-onnx/releases/download/v1.12.23/sherpa-onnx-v1.12.23-osx-universal2-shared.tar.bz2 | tar xjf -
mkdir -p ~/.clawdbot/sherpa-asr/runtime
mv sherpa-onnx*/* ~/.clawdbot/sherpa-asr/runtime/
  1. Download models
# STT Model (Chinese)
mkdir -p ~/.clawdbot/sherpa-asr/models
cd ~/.clawdbot/sherpa-asr/models
curl -L https://github.com/k2-fsa/sherpa-onnx/releases/download/asr-models/sherpa-onnx-paraformer-zh-2024-03-09.tar.bz2 | tar xjf -

# TTS Model (Chinese-English)
mkdir -p ~/.clawdbot/tools/sherpa-onnx-tts/models
cd ~/.clawdbot/tools/sherpa-onnx-tts/models
curl -L https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-melo-tts-zh_en.tar.bz2 | tar xjf -
  1. Install to Clawdbot skills
# Find Clawdbot skills directory
CLAWDBOT_SKILLS=$(npm root -g)/clawdbot@*/node_modules/clawdbot/skills

# Copy skill
cp -r ~/smart-voice-chat "$CLAWDBOT_SKILLS/"

# Verify
clawdbot skills list | grep smart-voice

Expected output: │ ✓ ready │ 🗣️ smart-voice- │ ...

  1. Restart Clawdbot
pkill -9 clawdbot
clawdbot-gateway &

⚙️ Configuration

Environment Variables (Optional)

Add to ~/.clawdbot/clawdbot.json:

{
  "skills": {
    "entries": {
      "smart-voice-chat": {
        "env": {
          "SMART_VOICE_CHAT_STT_MODEL": "~/.clawdbot/sherpa-asr/models/sherpa-onnx-paraformer-zh-2024-03-09",
          "SMART_VOICE_CHAT_TTS_MODEL": "~/.clawdbot/tools/sherpa-onnx-tts/models/vits-melo-tts-zh_en",
          "SMART_VOICE_CHAT_OUTPUT_DIR": "/tmp/smart-voice-chat"
        }
      }
    }
  }
}

SKILL.md Format

The SKILL.md frontmatter must use valid JSON:

---
name: smart-voice-chat
description: "Voice conversation with auto-detection (voice-to-voice, text-to-text)"
metadata: {"clawdbot":{"emoji":"🗣️","os":["darwin","linux"],"requires":{"anyBins":["ffmpeg"]}}}
---

Important:
- Always wrap description in quotes
- Avoid special characters (use to instead of )
- requires only supports: bins, anyBins, env, config
- Does NOT support: python field (ignored by Clawdbot)

💡 Usage

Default Mirror Mode

You: [Voice message] "What's the weather like today?"
AI:  [Voice + Text] "It's sunny today, 25°C"
You: "What's the weather like today?"
AI:  "It's sunny today, 25°C" [Text only]

Override with Keywords

You: "Reply with voice: Will it rain tomorrow?"
AI:  [Voice only] "It might rain lightly tomorrow"
You: "Reply with text: What time is it now?"
AI:  [Text only] "It's 4 PM now"

🔧 Tech Stack

Component Technology
STT Sherpa-ONNX Paraformer (zh-en)
TTS Sherpa-ONNX VITS-Melo (zh-en)
Audio FFmpeg (WAV → OGG/OPUS)
Language Python 3 + Bash

🐛 Troubleshooting

Skill not showing in clawdbot skills list

  1. Check SKILL.md syntax:
head -5 ~/.clawdbot/skills/smart-voice-chat/SKILL.md
  1. Verify FFmpeg is installed:
which ffmpeg
  1. Check logs:
tail -50 ~/.clawdbot/logs/gateway.err.log

OGG conversion fails

Install FFmpeg:

brew install ffmpeg  # macOS
sudo apt install ffmpeg  # Ubuntu/Debian

📄 License

MIT License - see LICENSE

🙏 Acknowledgments


Author: Johnny
GitHub: smart-voice-chat

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.