Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add ThePlasmak/faster-whisper
Or install specific skill: npx add-skill https://github.com/ThePlasmak/faster-whisper
# Description
Local speech-to-text using faster-whisper. 4-6x faster than OpenAI Whisper with identical accuracy; GPU acceleration enables ~20x realtime transcription. Supports standard and distilled models with word-level timestamps.
# SKILL.md
name: faster-whisper
description: Local speech-to-text using faster-whisper. 4-6x faster than OpenAI Whisper with identical accuracy; GPU acceleration enables ~20x realtime transcription. Supports standard and distilled models with word-level timestamps.
version: 1.0.4
author: ThePlasmak
homepage: https://github.com/ThePlasmak/faster-whisper
tags: ["audio", "transcription", "whisper", "speech-to-text", "ml", "cuda", "gpu"]
platforms: ["windows", "linux", "macos", "wsl2"]
metadata: {"moltbot":{"emoji":"π£οΈ","requires":{"bins":["ffmpeg","python3"]}}}
Faster Whisper
Local speech-to-text using faster-whisper β a CTranslate2 reimplementation of OpenAI's Whisper that runs 4-6x faster with identical accuracy. With GPU acceleration, expect ~20x realtime transcription (a 10-minute audio file in ~30 seconds).
When to Use
Use this skill when you need to:
- Transcribe audio/video files β meetings, interviews, podcasts, lectures, YouTube videos
- Convert speech to text locally β no API costs, works offline (after model download)
- Batch process multiple audio files β efficient for large collections
- Generate subtitles/captions β word-level timestamps available
- Multilingual transcription β supports 99+ languages with auto-detection
Trigger phrases: "transcribe this audio", "convert speech to text", "what did they say", "make a transcript", "audio to text", "subtitle this video"
When NOT to use:
- Real-time/streaming transcription (use streaming-optimized tools instead)
- Cloud-only environments without local compute
- Files <10 seconds where API call latency doesn't matter
Quick Reference
| Task | Command | Notes |
|---|---|---|
| Basic transcription | ./scripts/transcribe audio.mp3 |
Uses default distil-large-v3 |
| Faster English | ./scripts/transcribe audio.mp3 --model distil-medium.en --language en |
English-only, 6.8x faster |
| Maximum accuracy | ./scripts/transcribe audio.mp3 --model large-v3-turbo --beam-size 10 |
Slower but best quality |
| Word timestamps | ./scripts/transcribe audio.mp3 --word-timestamps |
For subtitles/captions |
| JSON output | ./scripts/transcribe audio.mp3 --json -o output.json |
Programmatic access |
| Multilingual | ./scripts/transcribe audio.mp3 --model large-v3-turbo |
Auto-detects language |
| Remove silence | ./scripts/transcribe audio.mp3 --vad |
Voice activity detection |
Model Selection
Choose the right model for your needs:
digraph model_selection {
rankdir=LR;
node [shape=box, style=rounded];
start [label="Start", shape=doublecircle];
need_accuracy [label="Need maximum\naccuracy?", shape=diamond];
multilingual [label="Multilingual\ncontent?", shape=diamond];
resource_constrained [label="Resource\nconstraints?", shape=diamond];
large_v3 [label="large-v3\nor\nlarge-v3-turbo", style="rounded,filled", fillcolor=lightblue];
large_turbo [label="large-v3-turbo", style="rounded,filled", fillcolor=lightblue];
distil_large [label="distil-large-v3\n(default)", style="rounded,filled", fillcolor=lightgreen];
distil_medium [label="distil-medium.en", style="rounded,filled", fillcolor=lightyellow];
distil_small [label="distil-small.en", style="rounded,filled", fillcolor=lightyellow];
start -> need_accuracy;
need_accuracy -> large_v3 [label="yes"];
need_accuracy -> multilingual [label="no"];
multilingual -> large_turbo [label="yes"];
multilingual -> resource_constrained [label="no (English)"];
resource_constrained -> distil_small [label="mobile/edge"];
resource_constrained -> distil_medium [label="some limits"];
resource_constrained -> distil_large [label="no"];
}
Model Table
Standard Models (Full Whisper)
| Model | Size | Speed | Accuracy | Use Case |
|---|---|---|---|---|
tiny / tiny.en |
39M | Fastest | Basic | Quick drafts |
base / base.en |
74M | Very fast | Good | General use |
small / small.en |
244M | Fast | Better | Most tasks |
medium / medium.en |
769M | Moderate | High | Quality transcription |
large-v1/v2/v3 |
1.5GB | Slower | Best | Maximum accuracy |
large-v3-turbo |
809M | Fast | Excellent | Recommended for accuracy |
Distilled Models (~6x Faster, ~1% WER difference)
| Model | Size | Speed vs Standard | Accuracy | Use Case |
|---|---|---|---|---|
distil-large-v3 |
756M | ~6.3x faster | 9.7% WER | Default, best balance |
distil-large-v2 |
756M | ~5.8x faster | 10.1% WER | Fallback |
distil-medium.en |
394M | ~6.8x faster | 11.1% WER | English-only, resource-constrained |
distil-small.en |
166M | ~5.6x faster | 12.1% WER | Mobile/edge devices |
.en models are English-only and slightly faster/better for English content.
Setup
Linux / macOS / WSL2
# Run the setup script (creates venv, installs deps, auto-detects GPU)
./setup.sh
Windows (Native)
# Run from PowerShell (auto-installs Python & ffmpeg if missing via winget)
.\setup.ps1
The Windows setup script will:
- Auto-install Python 3.12 via winget if not found
- Auto-install ffmpeg via winget if not found
- Detect NVIDIA GPU and install CUDA-enabled PyTorch
- Create venv and install all dependencies
Requirements:
- Linux/macOS/WSL2: Python 3.10+, ffmpeg
- Windows: Nothing! Setup auto-installs prerequisites via winget
Platform Support
| Platform | Acceleration | Speed | Auto-Install |
|---|---|---|---|
| Windows + NVIDIA GPU | CUDA | ~20x realtime π | β Full |
| Linux + NVIDIA GPU | CUDA | ~20x realtime π | Manual prereqs |
| WSL2 + NVIDIA GPU | CUDA | ~20x realtime π | Manual prereqs |
| macOS Apple Silicon | CPU* | ~3-5x realtime | Manual prereqs |
| macOS Intel | CPU | ~1-2x realtime | Manual prereqs |
| Windows (no GPU) | CPU | ~1x realtime | β Full |
| Linux (no GPU) | CPU | ~1x realtime | Manual prereqs |
*faster-whisper uses CTranslate2 which is CPU-only on macOS, but Apple Silicon is fast enough for practical use.
GPU Support (IMPORTANT!)
The setup script auto-detects your GPU and installs PyTorch with CUDA. Always use GPU if available β CPU transcription is extremely slow.
| Hardware | Speed | 9-min video |
|---|---|---|
| RTX 3070 (GPU) | ~20x realtime | ~27 sec |
| CPU (int8) | ~0.3x realtime | ~30 min |
If setup didn't detect your GPU, manually install PyTorch with CUDA:
Linux/macOS/WSL2:
# For CUDA 12.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu121
# For CUDA 11.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu118
Windows:
# For CUDA 12.x
.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121
# For CUDA 11.x
.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu118
- Windows users: Ensure you have NVIDIA drivers installed
- WSL2 users: Ensure you have the NVIDIA CUDA drivers for WSL installed on Windows
Usage
Linux/macOS/WSL2:
# Basic transcription
./scripts/transcribe audio.mp3
# With specific model
./scripts/transcribe audio.wav --model large-v3-turbo
# With word timestamps
./scripts/transcribe audio.mp3 --word-timestamps
# Specify language (faster than auto-detect)
./scripts/transcribe audio.mp3 --language en
# JSON output
./scripts/transcribe audio.mp3 --json
Windows (cmd or PowerShell):
# Basic transcription
.\scripts\transcribe.cmd audio.mp3
# With specific model
.\scripts\transcribe.cmd audio.wav --model large-v3-turbo
# With word timestamps (PowerShell native syntax also works)
.\scripts\transcribe.ps1 audio.mp3 -WordTimestamps
# JSON output
.\scripts\transcribe.cmd audio.mp3 --json
Options
--model, -m Model name (default: distil-large-v3)
--language, -l Language code (e.g., en, es, fr - auto-detect if omitted)
--word-timestamps Include word-level timestamps
--beam-size Beam search size (default: 5, higher = more accurate but slower)
--vad Enable voice activity detection (removes silence)
--json, -j Output as JSON
--output, -o Save transcript to file
--device cpu or cuda (auto-detected)
--compute-type int8, float16, float32 (default: auto-optimized)
--quiet, -q Suppress progress messages
Examples
# Transcribe YouTube audio (after extraction with yt-dlp)
yt-dlp -x --audio-format mp3 <URL> -o audio.mp3
./scripts/transcribe audio.mp3
# Batch transcription with JSON output
for file in *.mp3; do
./scripts/transcribe "$file" --json > "${file%.mp3}.json"
done
# High-accuracy transcription with larger beam size
./scripts/transcribe audio.mp3 \
--model large-v3-turbo --beam-size 10 --word-timestamps
# Fast English-only transcription
./scripts/transcribe audio.mp3 \
--model distil-medium.en --language en
# Transcribe with VAD (removes silence)
./scripts/transcribe audio.mp3 --vad
Common Mistakes
| Mistake | Problem | Solution |
|---|---|---|
| Using CPU when GPU available | 10-20x slower transcription | Check nvidia-smi on Windows/Linux; verify CUDA installation |
| Not specifying language | Wastes time auto-detecting on known content | Use --language en when you know the language |
| Using wrong model | Unnecessary slowness or poor accuracy | Default distil-large-v3 is excellent; only use large-v3 if accuracy issues |
| Ignoring distilled models | Missing 6x speedup with <1% accuracy loss | Try distil-large-v3 before reaching for standard models |
| Forgetting ffmpeg | Setup fails or audio can't be processed | Setup script handles this; manual installs need ffmpeg separately |
| Out of memory errors | Model too large for available VRAM/RAM | Use smaller model or --compute-type int8 |
| Over-engineering beam size | Diminishing returns past beam-size 5-7 | Default 5 is fine; try 10 for critical transcripts |
Performance Notes
- First run: Downloads model to
~/.cache/huggingface/(one-time) - GPU: Automatically uses CUDA if available (~10-20x faster)
- Quantization: INT8 used on CPU for ~4x speedup with minimal accuracy loss
- Memory:
distil-large-v3: ~2GB RAM / ~1GB VRAMlarge-v3-turbo: ~4GB RAM / ~2GB VRAMtiny/base: <1GB RAM
Why faster-whisper?
- Speed: ~4-6x faster than OpenAI's original Whisper
- Accuracy: Identical (uses same model weights)
- Efficiency: Lower memory usage via quantization
- Production-ready: Stable C++ backend (CTranslate2)
- Distilled models: ~6x faster with <1% accuracy loss
Troubleshooting
"CUDA not available β using CPU": Install PyTorch with CUDA (see GPU Support above)
Setup fails: Make sure Python 3.10+ is installed
Out of memory: Use smaller model or --compute-type int8
Slow on CPU: Expected β use GPU for practical transcription
Model download fails: Check ~/.cache/huggingface/ permissions (Linux/macOS) or %USERPROFILE%\.cache\huggingface\ (Windows)
Windows-Specific
"winget not found": Install App Installer from Microsoft Store, or install Python/ffmpeg manually
"Python not in PATH after install": Close and reopen your terminal, then run setup.ps1 again
PowerShell execution policy error: Run Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy RemoteSigned or use transcribe.cmd instead
nvidia-smi not found but have GPU: Install NVIDIA drivers β the Game Ready or Studio drivers include nvidia-smi
References
# README.md
faster-whisper
A skill for your Moltbot agent that uses faster-whisper to transcribe audio more quickly.
faster-whisper is superior to a OpenAI's Whisper β it's a CTranslate2 reimplementation that's ~4-6x faster with identical accuracy.
Features
- ~4-6x faster than OpenAI's original Whisper (same model weights, CTranslate2 backend)
- ~20x realtime with GPU β transcribe 10 min of audio in ~30 sec
- Distilled models available (~6x faster again with <1% WER loss)
- Word-level timestamps
- Voice activity detection (VAD) β removes silence automatically
- Multilingual support (100+ languages)
- Quantization for CPU efficiency
- GPU acceleration (NVIDIA CUDA)
Installation
Option 1: Install from MoltHub
Via CLI (no installation required):
# Using npx (npm)
npx molthub@latest install faster-whisper
# Using pnpm
pnpm dlx molthub@latest install faster-whisper
# Using bun
bunx molthub@latest install faster-whisper
This downloads and installs the skill into your default skills directory (~/clawd/your-agent/workspace/skills/ or similar).
Via Web UI:
Go to https://clawdhub.com/ThePlasmak/faster-whisper and download the zip.
Option 2: Download from GitHub Releases
- Go to Releases
- Download the latest
faster-whisper-X.X.X.zip - Extract it to your agent's skills folder:
- Default location:
~/clawd/your-agent/workspace/skills/faster-whisper - Or wherever your agent's workspace is configured
# Example
cd ~/clawd/your-agent/workspace/skills/
unzip ~/Downloads/faster-whisper-1.0.1.zip -d faster-whisper
If you're lazy: You can also ask your agent to install it by pasting this repo's link (https://github.com/ThePlasmak/faster-whisper) directly in chat.
Note: The release zip excludes repository files (CHANGELOG, LICENSE, README) and only contains the skill itself β this keeps things lightweight.
Setup
Using with your agent
If you're using your agent, it can guide you through the installation automatically.
What your agent does:
- Detects your platform (Windows/Linux/macOS/WSL2)
- Checks for Python, ffmpeg, and GPU drivers
- Runs the appropriate setup script for you
Standalone CLI Setup
If you want to use the transcription scripts directly without your agent:
Windows (Native):
.\setup.ps1 # Auto-installs Python & ffmpeg via winget if needed
Linux / macOS / WSL2:
./setup.sh
What it installs:
- Python 3.10+ (if missing)
- ffmpeg (audio processing)
- faster-whisper + dependencies
- CUDA support (if NVIDIA GPU detected)
How to Use
With your agent
Just ask in natural language:
"Transcribe this audio file" (with file attached)
"Transcribe interview.mp3 with word timestamps"
"Transcribe this in Spanish"
"Transcribe this and save as JSON"
Your agent will:
- Use the GPU if available
- Handle errors and suggest fixes
- Return formatted results
Standalone CLI
Run the transcription script directly:
Linux / macOS / WSL2:
./scripts/transcribe audio.mp3
./scripts/transcribe audio.mp3 --word-timestamps --json
Windows:
.\scripts\transcribe.cmd audio.mp3
.\scripts\transcribe.cmd audio.mp3 --word-timestamps --json
CLI Options
| Option | Short | Description |
|---|---|---|
--model |
-m |
Model name (default: distil-large-v3) |
--language |
-l |
Language code (e.g., en, es, fr) β auto-detects if omitted |
--word-timestamps |
Include word-level timestamps | |
--beam-size |
Beam search size (default: 5, higher = more accurate but slower) | |
--vad |
Enable voice activity detection (removes silence) | |
--json |
-j |
Output as JSON |
--output |
-o |
Save transcript to file |
--device |
cpu or cuda (auto-detected) |
|
--compute-type |
int8, float16, float32 (auto-optimized) |
|
--quiet |
-q |
Suppress progress messages |
CLI Examples
# Basic transcription
./scripts/transcribe interview.mp3
# Specify language (faster than auto-detect)
./scripts/transcribe podcast.mp3 --language en
# High-accuracy with word timestamps
./scripts/transcribe lecture.wav --model large-v3-turbo --word-timestamps
# JSON output saved to file
./scripts/transcribe meeting.m4a --json --output meeting.json
# Fast English-only transcription
./scripts/transcribe audio.mp3 --model distil-medium.en --language en
# Remove silence with VAD
./scripts/transcribe audio.mp3 --vad
Cross-Platform Support
| Platform | GPU Accel | Auto-Install |
|---|---|---|
| Windows + NVIDIA | β CUDA | β via winget |
| Linux + NVIDIA | β CUDA | β manual |
| WSL2 + NVIDIA | β CUDA | β manual |
| macOS (Apple Silicon) | β CPU only | β manual |
| Windows/Linux (no GPU) | β CPU only | Windows: β / Linux: β |
Notes:
- GPU acceleration is ~20-60x faster than CPU
- Apple Silicon Macs run on CPU but are still reasonably fast (~2-3x slower than CUDA)
- On platforms without auto-install, your agent can guide you through manual setup
CUDA Requirements (NVIDIA GPUs)
- Windows: CUDA drivers auto-install with GPU drivers
- Linux/WSL2: Install CUDA Toolkit separately:
```bash
# Ubuntu/Debian
sudo apt install nvidia-cuda-toolkit
# Check CUDA is available
nvidia-smi
```
Default Model
distil-large-v3 (756MB download)
- ~6x faster than large-v3
- Within ~1% accuracy of full model
- Best balance of speed and accuracy
See SKILL.md for full model list and recommendations.
Troubleshooting
CUDA Not Detected
Symptom: Script says "CUDA not available β using CPU (this will be slow!)"
Solutions:
-
Check NVIDIA drivers are installed:
bash # Windows/Linux/WSL2 nvidia-smi
If this fails, install/update your NVIDIA drivers -
WSL2 users: Install CUDA drivers on Windows, not inside WSL2
-
Follow: NVIDIA CUDA on WSL2 Guide
-
Reinstall PyTorch with CUDA:
```bash
# Linux/macOS/WSL2
.venv/bin/pip install torch --index-url https://download.pytorch.org/whl/cu121
# Windows
.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121
```
- Verify CUDA is working:
```bash
# Linux/macOS/WSL2
.venv/bin/python -c "import torch; print(torch.cuda.is_available())"
# Windows
.venv\Scripts\python -c "import torch; print(torch.cuda.is_available())"
```
Out of Memory Errors
Symptom: RuntimeError: CUDA out of memory or OutOfMemoryError
Solutions:
-
Use a smaller model:
bash # Try distil-medium instead of distil-large-v3 ./scripts/transcribe audio.mp3 --model distil-medium.en -
Use int8 quantization (reduces VRAM by ~4x):
bash ./scripts/transcribe audio.mp3 --compute-type int8 -
Fall back to CPU for large files:
bash ./scripts/transcribe audio.mp3 --device cpu -
Split long audio files into smaller chunks (5-10 min segments)
VRAM Requirements:
| Model | float16 | int8 |
|-------|---------|------|
| distil-large-v3 | ~2GB | ~1GB |
| large-v3 | ~5GB | ~2GB |
| medium | ~3GB | ~1.5GB |
| small | ~2GB | ~1GB |
ffmpeg Not Found
Symptom: FileNotFoundError: ffmpeg not found
Solutions:
-
Windows: Re-run setup script (auto-installs via winget)
powershell .\setup.ps1 -
Linux:
```bash
# Ubuntu/Debian
sudo apt install ffmpeg
# Fedora/RHEL
sudo dnf install ffmpeg
# Arch
sudo pacman -S ffmpeg
```
-
macOS:
bash brew install ffmpeg -
Verify installation:
bash ffmpeg -version
Model Download Fails
Symptom: HTTPError, ConnectionError, or timeout during first run
Solutions:
-
Check internet connection
-
Retry with increased timeout:
- Models download automatically on first use
-
Download sizes: 75MB (tiny) to 3GB (large-v3)
-
Use a VPN if Hugging Face is blocked in your region
-
Manual download:
python from faster_whisper import WhisperModel model = WhisperModel("distil-large-v3", device="cpu")
Very Slow Transcription
Symptom: Transcription takes longer than the audio duration
Expected speeds:
- GPU (CUDA): ~20-30x realtime (30 min audio β ~1-2 min)
- Apple Silicon (CPU): ~2-5x realtime (30 min audio β ~6-15 min)
- Intel CPU: ~0.5-1x realtime (30 min audio β 30-60 min)
Solutions:
-
Ensure GPU is being used:
bash # Look for "Loading model: ... (cuda, float16) on NVIDIA ..." ./scripts/transcribe audio.mp3 -
Use a smaller/distilled model:
bash ./scripts/transcribe audio.mp3 --model distil-small.en -
Specify language (skips auto-detection):
bash ./scripts/transcribe audio.mp3 --language en -
Reduce beam size:
bash ./scripts/transcribe audio.mp3 --beam-size 1
Audio Format Issues
Symptom: Error: Unsupported format or transcription produces garbage
Solutions:
- Supported formats: MP3, WAV, M4A, FLAC, OGG, WebM
-
Most common formats work via ffmpeg
-
Convert problematic formats:
bash ffmpeg -i input.xyz -ar 16000 output.mp3 -
Check audio isn't corrupted:
bash ffmpeg -i audio.mp3 -f null -
Python Version Issues
Symptom: SyntaxError or ImportError during setup
Solutions:
-
Requires Python 3.10 or newer:
bash python --version # or python3 --version -
Windows: Setup script auto-installs Python 3.12 via winget
-
Linux/macOS: Install Python 3.10+:
```bash
# Ubuntu
sudo apt install python3.12 python3.12-venv
# macOS
brew install [email protected]
```
Still Having Issues?
- Check the logs: Run without
--quietto see detailed error messages - Ask your agent: Paste the error β it can usually diagnose faster-whisper or installation issues
- Open an issue: GitHub Issues
- Include:
- Platform (Windows/Linux/macOS/WSL2)
- GPU model (if any)
- Python version
- Full error message
References
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.