faster-whisper

by @ThePlasmak in AI & LLM

# Install this skill:

npx skills add ThePlasmak/faster-whisper

Or install specific skill: npx add-skill https://github.com/ThePlasmak/faster-whisper

# Description

Local speech-to-text using faster-whisper. 4-6x faster than OpenAI Whisper with identical accuracy; GPU acceleration enables ~20x realtime transcription. Supports standard and distilled models with word-level timestamps.

# SKILL.md

name: faster-whisper
description: Local speech-to-text using faster-whisper. 4-6x faster than OpenAI Whisper with identical accuracy; GPU acceleration enables ~20x realtime transcription. Supports standard and distilled models with word-level timestamps.
version: 1.0.4
author: ThePlasmak
homepage: https://github.com/ThePlasmak/faster-whisper
tags: ["audio", "transcription", "whisper", "speech-to-text", "ml", "cuda", "gpu"]
platforms: ["windows", "linux", "macos", "wsl2"]
metadata: {"moltbot":{"emoji":"🗣️","requires":{"bins":["ffmpeg","python3"]}}}

Faster Whisper

Local speech-to-text using faster-whisper — a CTranslate2 reimplementation of OpenAI's Whisper that runs 4-6x faster with identical accuracy. With GPU acceleration, expect ~20x realtime transcription (a 10-minute audio file in ~30 seconds).

When to Use

Use this skill when you need to:
- Transcribe audio/video files — meetings, interviews, podcasts, lectures, YouTube videos
- Convert speech to text locally — no API costs, works offline (after model download)
- Batch process multiple audio files — efficient for large collections
- Generate subtitles/captions — word-level timestamps available
- Multilingual transcription — supports 99+ languages with auto-detection

Trigger phrases: "transcribe this audio", "convert speech to text", "what did they say", "make a transcript", "audio to text", "subtitle this video"

When NOT to use:
- Real-time/streaming transcription (use streaming-optimized tools instead)
- Cloud-only environments without local compute
- Files <10 seconds where API call latency doesn't matter

Quick Reference

Task	Command	Notes
Basic transcription	`./scripts/transcribe audio.mp3`	Uses default distil-large-v3
Faster English	`./scripts/transcribe audio.mp3 --model distil-medium.en --language en`	English-only, 6.8x faster
Maximum accuracy	`./scripts/transcribe audio.mp3 --model large-v3-turbo --beam-size 10`	Slower but best quality
Word timestamps	`./scripts/transcribe audio.mp3 --word-timestamps`	For subtitles/captions
JSON output	`./scripts/transcribe audio.mp3 --json -o output.json`	Programmatic access
Multilingual	`./scripts/transcribe audio.mp3 --model large-v3-turbo`	Auto-detects language
Remove silence	`./scripts/transcribe audio.mp3 --vad`	Voice activity detection

Model Selection

Choose the right model for your needs:

digraph model_selection {
    rankdir=LR;
    node [shape=box, style=rounded];

    start [label="Start", shape=doublecircle];
    need_accuracy [label="Need maximum\naccuracy?", shape=diamond];
    multilingual [label="Multilingual\ncontent?", shape=diamond];
    resource_constrained [label="Resource\nconstraints?", shape=diamond];

    large_v3 [label="large-v3\nor\nlarge-v3-turbo", style="rounded,filled", fillcolor=lightblue];
    large_turbo [label="large-v3-turbo", style="rounded,filled", fillcolor=lightblue];
    distil_large [label="distil-large-v3\n(default)", style="rounded,filled", fillcolor=lightgreen];
    distil_medium [label="distil-medium.en", style="rounded,filled", fillcolor=lightyellow];
    distil_small [label="distil-small.en", style="rounded,filled", fillcolor=lightyellow];

    start -> need_accuracy;
    need_accuracy -> large_v3 [label="yes"];
    need_accuracy -> multilingual [label="no"];
    multilingual -> large_turbo [label="yes"];
    multilingual -> resource_constrained [label="no (English)"];
    resource_constrained -> distil_small [label="mobile/edge"];
    resource_constrained -> distil_medium [label="some limits"];
    resource_constrained -> distil_large [label="no"];
}

Model Table

Standard Models (Full Whisper)

Model	Size	Speed	Accuracy	Use Case
`tiny` / `tiny.en`	39M	Fastest	Basic	Quick drafts
`base` / `base.en`	74M	Very fast	Good	General use
`small` / `small.en`	244M	Fast	Better	Most tasks
`medium` / `medium.en`	769M	Moderate	High	Quality transcription
`large-v1/v2/v3`	1.5GB	Slower	Best	Maximum accuracy
`large-v3-turbo`	809M	Fast	Excellent	Recommended for accuracy

Distilled Models (~6x Faster, ~1% WER difference)

Model	Size	Speed vs Standard	Accuracy	Use Case
`distil-large-v3`	756M	~6.3x faster	9.7% WER	Default, best balance
`distil-large-v2`	756M	~5.8x faster	10.1% WER	Fallback
`distil-medium.en`	394M	~6.8x faster	11.1% WER	English-only, resource-constrained
`distil-small.en`	166M	~5.6x faster	12.1% WER	Mobile/edge devices

.en models are English-only and slightly faster/better for English content.

Setup

Linux / macOS / WSL2

# Run the setup script (creates venv, installs deps, auto-detects GPU)
./setup.sh

Windows (Native)

# Run from PowerShell (auto-installs Python & ffmpeg if missing via winget)
.\setup.ps1

The Windows setup script will:
- Auto-install Python 3.12 via winget if not found
- Auto-install ffmpeg via winget if not found
- Detect NVIDIA GPU and install CUDA-enabled PyTorch
- Create venv and install all dependencies

Requirements:
- Linux/macOS/WSL2: Python 3.10+, ffmpeg
- Windows: Nothing! Setup auto-installs prerequisites via winget

Platform Support

Platform	Acceleration	Speed	Auto-Install
Windows + NVIDIA GPU	CUDA	~20x realtime 🚀	✅ Full
Linux + NVIDIA GPU	CUDA	~20x realtime 🚀	Manual prereqs
WSL2 + NVIDIA GPU	CUDA	~20x realtime 🚀	Manual prereqs
macOS Apple Silicon	CPU*	~3-5x realtime	Manual prereqs
macOS Intel	CPU	~1-2x realtime	Manual prereqs
Windows (no GPU)	CPU	~1x realtime	✅ Full
Linux (no GPU)	CPU	~1x realtime	Manual prereqs

*faster-whisper uses CTranslate2 which is CPU-only on macOS, but Apple Silicon is fast enough for practical use.

GPU Support (IMPORTANT!)

The setup script auto-detects your GPU and installs PyTorch with CUDA. Always use GPU if available — CPU transcription is extremely slow.

Hardware	Speed	9-min video
RTX 3070 (GPU)	~20x realtime	~27 sec
CPU (int8)	~0.3x realtime	~30 min

If setup didn't detect your GPU, manually install PyTorch with CUDA:

Linux/macOS/WSL2:

# For CUDA 12.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu121

# For CUDA 11.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu118

Windows:

# For CUDA 12.x
.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121

# For CUDA 11.x
.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu118

Windows users: Ensure you have NVIDIA drivers installed
WSL2 users: Ensure you have the NVIDIA CUDA drivers for WSL installed on Windows

Usage

Linux/macOS/WSL2:

# Basic transcription
./scripts/transcribe audio.mp3

# With specific model
./scripts/transcribe audio.wav --model large-v3-turbo

# With word timestamps
./scripts/transcribe audio.mp3 --word-timestamps

# Specify language (faster than auto-detect)
./scripts/transcribe audio.mp3 --language en

# JSON output
./scripts/transcribe audio.mp3 --json

Windows (cmd or PowerShell):

# Basic transcription
.\scripts\transcribe.cmd audio.mp3

# With specific model
.\scripts\transcribe.cmd audio.wav --model large-v3-turbo

# With word timestamps (PowerShell native syntax also works)
.\scripts\transcribe.ps1 audio.mp3 -WordTimestamps

# JSON output
.\scripts\transcribe.cmd audio.mp3 --json

Options

--model, -m        Model name (default: distil-large-v3)
--language, -l     Language code (e.g., en, es, fr - auto-detect if omitted)
--word-timestamps  Include word-level timestamps
--beam-size        Beam search size (default: 5, higher = more accurate but slower)
--vad              Enable voice activity detection (removes silence)
--json, -j         Output as JSON
--output, -o       Save transcript to file
--device           cpu or cuda (auto-detected)
--compute-type     int8, float16, float32 (default: auto-optimized)
--quiet, -q        Suppress progress messages

Examples

# Transcribe YouTube audio (after extraction with yt-dlp)
yt-dlp -x --audio-format mp3 <URL> -o audio.mp3
./scripts/transcribe audio.mp3

# Batch transcription with JSON output
for file in *.mp3; do
  ./scripts/transcribe "$file" --json > "${file%.mp3}.json"
done

# High-accuracy transcription with larger beam size
./scripts/transcribe audio.mp3 \
  --model large-v3-turbo --beam-size 10 --word-timestamps

# Fast English-only transcription
./scripts/transcribe audio.mp3 \
  --model distil-medium.en --language en

# Transcribe with VAD (removes silence)
./scripts/transcribe audio.mp3 --vad

Common Mistakes

Mistake	Problem	Solution
Using CPU when GPU available	10-20x slower transcription	Check `nvidia-smi` on Windows/Linux; verify CUDA installation
Not specifying language	Wastes time auto-detecting on known content	Use `--language en` when you know the language
Using wrong model	Unnecessary slowness or poor accuracy	Default `distil-large-v3` is excellent; only use `large-v3` if accuracy issues
Ignoring distilled models	Missing 6x speedup with <1% accuracy loss	Try `distil-large-v3` before reaching for standard models
Forgetting ffmpeg	Setup fails or audio can't be processed	Setup script handles this; manual installs need ffmpeg separately
Out of memory errors	Model too large for available VRAM/RAM	Use smaller model or `--compute-type int8`
Over-engineering beam size	Diminishing returns past beam-size 5-7	Default 5 is fine; try 10 for critical transcripts

Performance Notes

First run: Downloads model to ~/.cache/huggingface/ (one-time)
GPU: Automatically uses CUDA if available (~10-20x faster)
Quantization: INT8 used on CPU for ~4x speedup with minimal accuracy loss
Memory:
distil-large-v3: ~2GB RAM / ~1GB VRAM
large-v3-turbo: ~4GB RAM / ~2GB VRAM
tiny/base: <1GB RAM

Why faster-whisper?

Speed: ~4-6x faster than OpenAI's original Whisper
Accuracy: Identical (uses same model weights)
Efficiency: Lower memory usage via quantization
Production-ready: Stable C++ backend (CTranslate2)
Distilled models: ~6x faster with <1% accuracy loss

Troubleshooting

"CUDA not available — using CPU": Install PyTorch with CUDA (see GPU Support above)
Setup fails: Make sure Python 3.10+ is installed
Out of memory: Use smaller model or --compute-type int8
Slow on CPU: Expected — use GPU for practical transcription
Model download fails: Check ~/.cache/huggingface/ permissions (Linux/macOS) or %USERPROFILE%\.cache\huggingface\ (Windows)

Windows-Specific

"winget not found": Install App Installer from Microsoft Store, or install Python/ffmpeg manually
"Python not in PATH after install": Close and reopen your terminal, then run setup.ps1 again
PowerShell execution policy error: Run Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy RemoteSigned or use transcribe.cmd instead
nvidia-smi not found but have GPU: Install NVIDIA drivers — the Game Ready or Studio drivers include nvidia-smi

References

# README.md

faster-whisper

A skill for your Moltbot agent that uses faster-whisper to transcribe audio more quickly.

faster-whisper is superior to a OpenAI's Whisper — it's a CTranslate2 reimplementation that's ~4-6x faster with identical accuracy.

Features

~4-6x faster than OpenAI's original Whisper (same model weights, CTranslate2 backend)
~20x realtime with GPU — transcribe 10 min of audio in ~30 sec
Distilled models available (~6x faster again with <1% WER loss)
Word-level timestamps
Voice activity detection (VAD) — removes silence automatically
Multilingual support (100+ languages)
Quantization for CPU efficiency
GPU acceleration (NVIDIA CUDA)

Installation

Option 1: Install from MoltHub

Via CLI (no installation required):

# Using npx (npm)
npx molthub@latest install faster-whisper

# Using pnpm
pnpm dlx molthub@latest install faster-whisper

# Using bun
bunx molthub@latest install faster-whisper

This downloads and installs the skill into your default skills directory (~/clawd/your-agent/workspace/skills/ or similar).

Via Web UI:
Go to https://clawdhub.com/ThePlasmak/faster-whisper and download the zip.

Option 2: Download from GitHub Releases

Go to Releases
Download the latest faster-whisper-X.X.X.zip
Extract it to your agent's skills folder:
Default location: ~/clawd/your-agent/workspace/skills/faster-whisper
Or wherever your agent's workspace is configured

# Example
cd ~/clawd/your-agent/workspace/skills/
unzip ~/Downloads/faster-whisper-1.0.1.zip -d faster-whisper

If you're lazy: You can also ask your agent to install it by pasting this repo's link (https://github.com/ThePlasmak/faster-whisper) directly in chat.

Note: The release zip excludes repository files (CHANGELOG, LICENSE, README) and only contains the skill itself — this keeps things lightweight.

Setup

Using with your agent

If you're using your agent, it can guide you through the installation automatically.

What your agent does:

Detects your platform (Windows/Linux/macOS/WSL2)
Checks for Python, ffmpeg, and GPU drivers
Runs the appropriate setup script for you

Standalone CLI Setup

If you want to use the transcription scripts directly without your agent:

Windows (Native):

.\setup.ps1   # Auto-installs Python & ffmpeg via winget if needed

Linux / macOS / WSL2:

./setup.sh

What it installs:
- Python 3.10+ (if missing)
- ffmpeg (audio processing)
- faster-whisper + dependencies
- CUDA support (if NVIDIA GPU detected)

How to Use

With your agent

Just ask in natural language:

"Transcribe this audio file" (with file attached)
"Transcribe interview.mp3 with word timestamps"
"Transcribe this in Spanish"
"Transcribe this and save as JSON"

Your agent will:
- Use the GPU if available
- Handle errors and suggest fixes
- Return formatted results

Standalone CLI

Run the transcription script directly:

Linux / macOS / WSL2:

./scripts/transcribe audio.mp3
./scripts/transcribe audio.mp3 --word-timestamps --json

Windows:

.\scripts\transcribe.cmd audio.mp3
.\scripts\transcribe.cmd audio.mp3 --word-timestamps --json

CLI Options

Option	Short	Description
`--model`	`-m`	Model name (default: `distil-large-v3`)
`--language`	`-l`	Language code (e.g., `en`, `es`, `fr`) — auto-detects if omitted
`--word-timestamps`		Include word-level timestamps
`--beam-size`		Beam search size (default: 5, higher = more accurate but slower)
`--vad`		Enable voice activity detection (removes silence)
`--json`	`-j`	Output as JSON
`--output`	`-o`	Save transcript to file
`--device`		`cpu` or `cuda` (auto-detected)
`--compute-type`		`int8`, `float16`, `float32` (auto-optimized)
`--quiet`	`-q`	Suppress progress messages

CLI Examples

# Basic transcription
./scripts/transcribe interview.mp3

# Specify language (faster than auto-detect)
./scripts/transcribe podcast.mp3 --language en

# High-accuracy with word timestamps
./scripts/transcribe lecture.wav --model large-v3-turbo --word-timestamps

# JSON output saved to file
./scripts/transcribe meeting.m4a --json --output meeting.json

# Fast English-only transcription
./scripts/transcribe audio.mp3 --model distil-medium.en --language en

# Remove silence with VAD
./scripts/transcribe audio.mp3 --vad

Cross-Platform Support

Platform	GPU Accel	Auto-Install
Windows + NVIDIA	✅ CUDA	✅ via winget
Linux + NVIDIA	✅ CUDA	❌ manual
WSL2 + NVIDIA	✅ CUDA	❌ manual
macOS (Apple Silicon)	❌ CPU only	❌ manual
Windows/Linux (no GPU)	❌ CPU only	Windows: ✅ / Linux: ❌

Notes:
- GPU acceleration is ~20-60x faster than CPU
- Apple Silicon Macs run on CPU but are still reasonably fast (~2-3x slower than CUDA)
- On platforms without auto-install, your agent can guide you through manual setup

CUDA Requirements (NVIDIA GPUs)

Windows: CUDA drivers auto-install with GPU drivers
Linux/WSL2: Install CUDA Toolkit separately:
```bash
# Ubuntu/Debian
sudo apt install nvidia-cuda-toolkit

# Check CUDA is available
nvidia-smi
```

Default Model

distil-large-v3 (756MB download)

~6x faster than large-v3
Within ~1% accuracy of full model
Best balance of speed and accuracy

See SKILL.md for full model list and recommendations.

Troubleshooting

CUDA Not Detected

Symptom: Script says "CUDA not available — using CPU (this will be slow!)"

Solutions:

Check NVIDIA drivers are installed:
bash # Windows/Linux/WSL2 nvidia-smi
If this fails, install/update your NVIDIA drivers
WSL2 users: Install CUDA drivers on Windows, not inside WSL2
Follow: NVIDIA CUDA on WSL2 Guide
Reinstall PyTorch with CUDA:
```bash
# Linux/macOS/WSL2
.venv/bin/pip install torch --index-url https://download.pytorch.org/whl/cu121

# Windows
.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121
```

Verify CUDA is working:
```bash
# Linux/macOS/WSL2
.venv/bin/python -c "import torch; print(torch.cuda.is_available())"

# Windows
.venv\Scripts\python -c "import torch; print(torch.cuda.is_available())"
```

Out of Memory Errors

Symptom: RuntimeError: CUDA out of memory or OutOfMemoryError

Solutions:

Use a smaller model:
bash # Try distil-medium instead of distil-large-v3 ./scripts/transcribe audio.mp3 --model distil-medium.en
Use int8 quantization (reduces VRAM by ~4x):
bash ./scripts/transcribe audio.mp3 --compute-type int8
Fall back to CPU for large files:
bash ./scripts/transcribe audio.mp3 --device cpu
Split long audio files into smaller chunks (5-10 min segments)

VRAM Requirements:
| Model | float16 | int8 |
|-------|---------|------|
| distil-large-v3 | ~2GB | ~1GB |
| large-v3 | ~5GB | ~2GB |
| medium | ~3GB | ~1.5GB |
| small | ~2GB | ~1GB |

ffmpeg Not Found

Symptom: FileNotFoundError: ffmpeg not found

Solutions:

Windows: Re-run setup script (auto-installs via winget)
powershell .\setup.ps1
Linux:
```bash
# Ubuntu/Debian
sudo apt install ffmpeg

# Fedora/RHEL
sudo dnf install ffmpeg

# Arch
sudo pacman -S ffmpeg
```

macOS:
bash brew install ffmpeg
Verify installation:
bash ffmpeg -version

Model Download Fails

Symptom: HTTPError, ConnectionError, or timeout during first run

Solutions:

Check internet connection
Retry with increased timeout:
Models download automatically on first use
Download sizes: 75MB (tiny) to 3GB (large-v3)
Use a VPN if Hugging Face is blocked in your region
Manual download:
python from faster_whisper import WhisperModel model = WhisperModel("distil-large-v3", device="cpu")

Very Slow Transcription

Symptom: Transcription takes longer than the audio duration

Expected speeds:
- GPU (CUDA): ~20-30x realtime (30 min audio → ~1-2 min)
- Apple Silicon (CPU): ~2-5x realtime (30 min audio → ~6-15 min)
- Intel CPU: ~0.5-1x realtime (30 min audio → 30-60 min)

Solutions:

Ensure GPU is being used:
bash # Look for "Loading model: ... (cuda, float16) on NVIDIA ..." ./scripts/transcribe audio.mp3
Use a smaller/distilled model:
bash ./scripts/transcribe audio.mp3 --model distil-small.en
Specify language (skips auto-detection):
bash ./scripts/transcribe audio.mp3 --language en
Reduce beam size:
bash ./scripts/transcribe audio.mp3 --beam-size 1

Audio Format Issues

Symptom: Error: Unsupported format or transcription produces garbage

Solutions:

Supported formats: MP3, WAV, M4A, FLAC, OGG, WebM
Most common formats work via ffmpeg
Convert problematic formats:
bash ffmpeg -i input.xyz -ar 16000 output.mp3
Check audio isn't corrupted:
bash ffmpeg -i audio.mp3 -f null -

Python Version Issues

Symptom: SyntaxError or ImportError during setup

Solutions:

Requires Python 3.10 or newer:
bash python --version # or python3 --version
Windows: Setup script auto-installs Python 3.12 via winget
Linux/macOS: Install Python 3.10+:
```bash
# Ubuntu
sudo apt install python3.12 python3.12-venv

# macOS
brew install [email protected]
```

Still Having Issues?

Check the logs: Run without --quiet to see detailed error messages
Ask your agent: Paste the error — it can usually diagnose faster-whisper or installation issues
Open an issue: GitHub Issues
Include:
Platform (Windows/Linux/macOS/WSL2)
GPU model (if any)
Python version
Full error message

References

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.

faster-whisper

# Description

# SKILL.md

Faster Whisper

When to Use

Quick Reference

Model Selection

Model Table

Standard Models (Full Whisper)

Distilled Models (~6x Faster, ~1% WER difference)

Setup

Linux / macOS / WSL2

Windows (Native)

Platform Support

GPU Support (IMPORTANT!)

Usage

Options

Examples

Common Mistakes

Performance Notes

Why faster-whisper?

Troubleshooting

Windows-Specific

References

# README.md

faster-whisper

Features

Installation

Option 1: Install from MoltHub

Option 2: Download from GitHub Releases

Setup

Using with your agent

Standalone CLI Setup

How to Use

With your agent

Standalone CLI

CLI Options

CLI Examples

Cross-Platform Support

CUDA Requirements (NVIDIA GPUs)

Default Model

Troubleshooting

CUDA Not Detected

Out of Memory Errors

ffmpeg Not Found

Model Download Fails

Very Slow Transcription

Audio Format Issues

Python Version Issues

Still Having Issues?

References

# Related Skills

# Supported AI Coding Agents

Confirm

Submit a Skill