ThePlasmak

faster-whisper

1
0
# Install this skill:
npx skills add ThePlasmak/faster-whisper

Or install specific skill: npx add-skill https://github.com/ThePlasmak/faster-whisper

# Description

Local speech-to-text using faster-whisper. 4-6x faster than OpenAI Whisper with identical accuracy; GPU acceleration enables ~20x realtime transcription. Supports standard and distilled models with word-level timestamps.

# SKILL.md


name: faster-whisper
description: Local speech-to-text using faster-whisper. 4-6x faster than OpenAI Whisper with identical accuracy; GPU acceleration enables ~20x realtime transcription. Supports standard and distilled models with word-level timestamps.
version: 1.0.4
author: ThePlasmak
homepage: https://github.com/ThePlasmak/faster-whisper
tags: ["audio", "transcription", "whisper", "speech-to-text", "ml", "cuda", "gpu"]
platforms: ["windows", "linux", "macos", "wsl2"]
metadata: {"moltbot":{"emoji":"🗣️","requires":{"bins":["ffmpeg","python3"]}}}


Faster Whisper

Local speech-to-text using faster-whisper — a CTranslate2 reimplementation of OpenAI's Whisper that runs 4-6x faster with identical accuracy. With GPU acceleration, expect ~20x realtime transcription (a 10-minute audio file in ~30 seconds).

When to Use

Use this skill when you need to:
- Transcribe audio/video files — meetings, interviews, podcasts, lectures, YouTube videos
- Convert speech to text locally — no API costs, works offline (after model download)
- Batch process multiple audio files — efficient for large collections
- Generate subtitles/captions — word-level timestamps available
- Multilingual transcription — supports 99+ languages with auto-detection

Trigger phrases: "transcribe this audio", "convert speech to text", "what did they say", "make a transcript", "audio to text", "subtitle this video"

When NOT to use:
- Real-time/streaming transcription (use streaming-optimized tools instead)
- Cloud-only environments without local compute
- Files <10 seconds where API call latency doesn't matter

Quick Reference

Task Command Notes
Basic transcription ./scripts/transcribe audio.mp3 Uses default distil-large-v3
Faster English ./scripts/transcribe audio.mp3 --model distil-medium.en --language en English-only, 6.8x faster
Maximum accuracy ./scripts/transcribe audio.mp3 --model large-v3-turbo --beam-size 10 Slower but best quality
Word timestamps ./scripts/transcribe audio.mp3 --word-timestamps For subtitles/captions
JSON output ./scripts/transcribe audio.mp3 --json -o output.json Programmatic access
Multilingual ./scripts/transcribe audio.mp3 --model large-v3-turbo Auto-detects language
Remove silence ./scripts/transcribe audio.mp3 --vad Voice activity detection

Model Selection

Choose the right model for your needs:

digraph model_selection {
    rankdir=LR;
    node [shape=box, style=rounded];

    start [label="Start", shape=doublecircle];
    need_accuracy [label="Need maximum\naccuracy?", shape=diamond];
    multilingual [label="Multilingual\ncontent?", shape=diamond];
    resource_constrained [label="Resource\nconstraints?", shape=diamond];

    large_v3 [label="large-v3\nor\nlarge-v3-turbo", style="rounded,filled", fillcolor=lightblue];
    large_turbo [label="large-v3-turbo", style="rounded,filled", fillcolor=lightblue];
    distil_large [label="distil-large-v3\n(default)", style="rounded,filled", fillcolor=lightgreen];
    distil_medium [label="distil-medium.en", style="rounded,filled", fillcolor=lightyellow];
    distil_small [label="distil-small.en", style="rounded,filled", fillcolor=lightyellow];

    start -> need_accuracy;
    need_accuracy -> large_v3 [label="yes"];
    need_accuracy -> multilingual [label="no"];
    multilingual -> large_turbo [label="yes"];
    multilingual -> resource_constrained [label="no (English)"];
    resource_constrained -> distil_small [label="mobile/edge"];
    resource_constrained -> distil_medium [label="some limits"];
    resource_constrained -> distil_large [label="no"];
}

Model Table

Standard Models (Full Whisper)

Model Size Speed Accuracy Use Case
tiny / tiny.en 39M Fastest Basic Quick drafts
base / base.en 74M Very fast Good General use
small / small.en 244M Fast Better Most tasks
medium / medium.en 769M Moderate High Quality transcription
large-v1/v2/v3 1.5GB Slower Best Maximum accuracy
large-v3-turbo 809M Fast Excellent Recommended for accuracy

Distilled Models (~6x Faster, ~1% WER difference)

Model Size Speed vs Standard Accuracy Use Case
distil-large-v3 756M ~6.3x faster 9.7% WER Default, best balance
distil-large-v2 756M ~5.8x faster 10.1% WER Fallback
distil-medium.en 394M ~6.8x faster 11.1% WER English-only, resource-constrained
distil-small.en 166M ~5.6x faster 12.1% WER Mobile/edge devices

.en models are English-only and slightly faster/better for English content.

Setup

Linux / macOS / WSL2

# Run the setup script (creates venv, installs deps, auto-detects GPU)
./setup.sh

Windows (Native)

# Run from PowerShell (auto-installs Python & ffmpeg if missing via winget)
.\setup.ps1

The Windows setup script will:
- Auto-install Python 3.12 via winget if not found
- Auto-install ffmpeg via winget if not found
- Detect NVIDIA GPU and install CUDA-enabled PyTorch
- Create venv and install all dependencies

Requirements:
- Linux/macOS/WSL2: Python 3.10+, ffmpeg
- Windows: Nothing! Setup auto-installs prerequisites via winget

Platform Support

Platform Acceleration Speed Auto-Install
Windows + NVIDIA GPU CUDA ~20x realtime 🚀 ✅ Full
Linux + NVIDIA GPU CUDA ~20x realtime 🚀 Manual prereqs
WSL2 + NVIDIA GPU CUDA ~20x realtime 🚀 Manual prereqs
macOS Apple Silicon CPU* ~3-5x realtime Manual prereqs
macOS Intel CPU ~1-2x realtime Manual prereqs
Windows (no GPU) CPU ~1x realtime ✅ Full
Linux (no GPU) CPU ~1x realtime Manual prereqs

*faster-whisper uses CTranslate2 which is CPU-only on macOS, but Apple Silicon is fast enough for practical use.

GPU Support (IMPORTANT!)

The setup script auto-detects your GPU and installs PyTorch with CUDA. Always use GPU if available — CPU transcription is extremely slow.

Hardware Speed 9-min video
RTX 3070 (GPU) ~20x realtime ~27 sec
CPU (int8) ~0.3x realtime ~30 min

If setup didn't detect your GPU, manually install PyTorch with CUDA:

Linux/macOS/WSL2:

# For CUDA 12.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu121

# For CUDA 11.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu118

Windows:

# For CUDA 12.x
.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121

# For CUDA 11.x
.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu118

Usage

Linux/macOS/WSL2:

# Basic transcription
./scripts/transcribe audio.mp3

# With specific model
./scripts/transcribe audio.wav --model large-v3-turbo

# With word timestamps
./scripts/transcribe audio.mp3 --word-timestamps

# Specify language (faster than auto-detect)
./scripts/transcribe audio.mp3 --language en

# JSON output
./scripts/transcribe audio.mp3 --json

Windows (cmd or PowerShell):

# Basic transcription
.\scripts\transcribe.cmd audio.mp3

# With specific model
.\scripts\transcribe.cmd audio.wav --model large-v3-turbo

# With word timestamps (PowerShell native syntax also works)
.\scripts\transcribe.ps1 audio.mp3 -WordTimestamps

# JSON output
.\scripts\transcribe.cmd audio.mp3 --json

Options

--model, -m        Model name (default: distil-large-v3)
--language, -l     Language code (e.g., en, es, fr - auto-detect if omitted)
--word-timestamps  Include word-level timestamps
--beam-size        Beam search size (default: 5, higher = more accurate but slower)
--vad              Enable voice activity detection (removes silence)
--json, -j         Output as JSON
--output, -o       Save transcript to file
--device           cpu or cuda (auto-detected)
--compute-type     int8, float16, float32 (default: auto-optimized)
--quiet, -q        Suppress progress messages

Examples

# Transcribe YouTube audio (after extraction with yt-dlp)
yt-dlp -x --audio-format mp3 <URL> -o audio.mp3
./scripts/transcribe audio.mp3

# Batch transcription with JSON output
for file in *.mp3; do
  ./scripts/transcribe "$file" --json > "${file%.mp3}.json"
done

# High-accuracy transcription with larger beam size
./scripts/transcribe audio.mp3 \
  --model large-v3-turbo --beam-size 10 --word-timestamps

# Fast English-only transcription
./scripts/transcribe audio.mp3 \
  --model distil-medium.en --language en

# Transcribe with VAD (removes silence)
./scripts/transcribe audio.mp3 --vad

Common Mistakes

Mistake Problem Solution
Using CPU when GPU available 10-20x slower transcription Check nvidia-smi on Windows/Linux; verify CUDA installation
Not specifying language Wastes time auto-detecting on known content Use --language en when you know the language
Using wrong model Unnecessary slowness or poor accuracy Default distil-large-v3 is excellent; only use large-v3 if accuracy issues
Ignoring distilled models Missing 6x speedup with <1% accuracy loss Try distil-large-v3 before reaching for standard models
Forgetting ffmpeg Setup fails or audio can't be processed Setup script handles this; manual installs need ffmpeg separately
Out of memory errors Model too large for available VRAM/RAM Use smaller model or --compute-type int8
Over-engineering beam size Diminishing returns past beam-size 5-7 Default 5 is fine; try 10 for critical transcripts

Performance Notes

  • First run: Downloads model to ~/.cache/huggingface/ (one-time)
  • GPU: Automatically uses CUDA if available (~10-20x faster)
  • Quantization: INT8 used on CPU for ~4x speedup with minimal accuracy loss
  • Memory:
  • distil-large-v3: ~2GB RAM / ~1GB VRAM
  • large-v3-turbo: ~4GB RAM / ~2GB VRAM
  • tiny/base: <1GB RAM

Why faster-whisper?

  • Speed: ~4-6x faster than OpenAI's original Whisper
  • Accuracy: Identical (uses same model weights)
  • Efficiency: Lower memory usage via quantization
  • Production-ready: Stable C++ backend (CTranslate2)
  • Distilled models: ~6x faster with <1% accuracy loss

Troubleshooting

"CUDA not available — using CPU": Install PyTorch with CUDA (see GPU Support above)
Setup fails: Make sure Python 3.10+ is installed
Out of memory: Use smaller model or --compute-type int8
Slow on CPU: Expected — use GPU for practical transcription
Model download fails: Check ~/.cache/huggingface/ permissions (Linux/macOS) or %USERPROFILE%\.cache\huggingface\ (Windows)

Windows-Specific

"winget not found": Install App Installer from Microsoft Store, or install Python/ffmpeg manually
"Python not in PATH after install": Close and reopen your terminal, then run setup.ps1 again
PowerShell execution policy error: Run Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy RemoteSigned or use transcribe.cmd instead
nvidia-smi not found but have GPU: Install NVIDIA drivers — the Game Ready or Studio drivers include nvidia-smi

References

# README.md

faster-whisper

A skill for your Moltbot agent that uses faster-whisper to transcribe audio more quickly.

faster-whisper is superior to a OpenAI's Whisper — it's a CTranslate2 reimplementation that's ~4-6x faster with identical accuracy.

Features

  • ~4-6x faster than OpenAI's original Whisper (same model weights, CTranslate2 backend)
  • ~20x realtime with GPU — transcribe 10 min of audio in ~30 sec
  • Distilled models available (~6x faster again with <1% WER loss)
  • Word-level timestamps
  • Voice activity detection (VAD) — removes silence automatically
  • Multilingual support (100+ languages)
  • Quantization for CPU efficiency
  • GPU acceleration (NVIDIA CUDA)

Installation

Option 1: Install from MoltHub

Via CLI (no installation required):

# Using npx (npm)
npx molthub@latest install faster-whisper

# Using pnpm
pnpm dlx molthub@latest install faster-whisper

# Using bun
bunx molthub@latest install faster-whisper

This downloads and installs the skill into your default skills directory (~/clawd/your-agent/workspace/skills/ or similar).

Via Web UI:
Go to https://clawdhub.com/ThePlasmak/faster-whisper and download the zip.

Option 2: Download from GitHub Releases

  1. Go to Releases
  2. Download the latest faster-whisper-X.X.X.zip
  3. Extract it to your agent's skills folder:
  4. Default location: ~/clawd/your-agent/workspace/skills/faster-whisper
  5. Or wherever your agent's workspace is configured
# Example
cd ~/clawd/your-agent/workspace/skills/
unzip ~/Downloads/faster-whisper-1.0.1.zip -d faster-whisper

If you're lazy: You can also ask your agent to install it by pasting this repo's link (https://github.com/ThePlasmak/faster-whisper) directly in chat.

Note: The release zip excludes repository files (CHANGELOG, LICENSE, README) and only contains the skill itself — this keeps things lightweight.

Setup

Using with your agent

If you're using your agent, it can guide you through the installation automatically.

What your agent does:

  • Detects your platform (Windows/Linux/macOS/WSL2)
  • Checks for Python, ffmpeg, and GPU drivers
  • Runs the appropriate setup script for you

Standalone CLI Setup

If you want to use the transcription scripts directly without your agent:

Windows (Native):

.\setup.ps1   # Auto-installs Python & ffmpeg via winget if needed

Linux / macOS / WSL2:

./setup.sh

What it installs:
- Python 3.10+ (if missing)
- ffmpeg (audio processing)
- faster-whisper + dependencies
- CUDA support (if NVIDIA GPU detected)

How to Use

With your agent

Just ask in natural language:

"Transcribe this audio file" (with file attached)
"Transcribe interview.mp3 with word timestamps"
"Transcribe this in Spanish"
"Transcribe this and save as JSON"

Your agent will:
- Use the GPU if available
- Handle errors and suggest fixes
- Return formatted results

Standalone CLI

Run the transcription script directly:

Linux / macOS / WSL2:

./scripts/transcribe audio.mp3
./scripts/transcribe audio.mp3 --word-timestamps --json

Windows:

.\scripts\transcribe.cmd audio.mp3
.\scripts\transcribe.cmd audio.mp3 --word-timestamps --json

CLI Options

Option Short Description
--model -m Model name (default: distil-large-v3)
--language -l Language code (e.g., en, es, fr) — auto-detects if omitted
--word-timestamps Include word-level timestamps
--beam-size Beam search size (default: 5, higher = more accurate but slower)
--vad Enable voice activity detection (removes silence)
--json -j Output as JSON
--output -o Save transcript to file
--device cpu or cuda (auto-detected)
--compute-type int8, float16, float32 (auto-optimized)
--quiet -q Suppress progress messages

CLI Examples

# Basic transcription
./scripts/transcribe interview.mp3

# Specify language (faster than auto-detect)
./scripts/transcribe podcast.mp3 --language en

# High-accuracy with word timestamps
./scripts/transcribe lecture.wav --model large-v3-turbo --word-timestamps

# JSON output saved to file
./scripts/transcribe meeting.m4a --json --output meeting.json

# Fast English-only transcription
./scripts/transcribe audio.mp3 --model distil-medium.en --language en

# Remove silence with VAD
./scripts/transcribe audio.mp3 --vad

Cross-Platform Support

Platform GPU Accel Auto-Install
Windows + NVIDIA ✅ CUDA ✅ via winget
Linux + NVIDIA ✅ CUDA ❌ manual
WSL2 + NVIDIA ✅ CUDA ❌ manual
macOS (Apple Silicon) ❌ CPU only ❌ manual
Windows/Linux (no GPU) ❌ CPU only Windows: ✅ / Linux: ❌

Notes:
- GPU acceleration is ~20-60x faster than CPU
- Apple Silicon Macs run on CPU but are still reasonably fast (~2-3x slower than CUDA)
- On platforms without auto-install, your agent can guide you through manual setup

CUDA Requirements (NVIDIA GPUs)

  • Windows: CUDA drivers auto-install with GPU drivers
  • Linux/WSL2: Install CUDA Toolkit separately:
    ```bash
    # Ubuntu/Debian
    sudo apt install nvidia-cuda-toolkit

# Check CUDA is available
nvidia-smi
```

Default Model

distil-large-v3 (756MB download)

  • ~6x faster than large-v3
  • Within ~1% accuracy of full model
  • Best balance of speed and accuracy

See SKILL.md for full model list and recommendations.

Troubleshooting

CUDA Not Detected

Symptom: Script says "CUDA not available — using CPU (this will be slow!)"

Solutions:

  1. Check NVIDIA drivers are installed:
    bash # Windows/Linux/WSL2 nvidia-smi
    If this fails, install/update your NVIDIA drivers

  2. WSL2 users: Install CUDA drivers on Windows, not inside WSL2

  3. Follow: NVIDIA CUDA on WSL2 Guide

  4. Reinstall PyTorch with CUDA:
    ```bash
    # Linux/macOS/WSL2
    .venv/bin/pip install torch --index-url https://download.pytorch.org/whl/cu121

# Windows
.venv\Scripts\pip install torch --index-url https://download.pytorch.org/whl/cu121
```

  1. Verify CUDA is working:
    ```bash
    # Linux/macOS/WSL2
    .venv/bin/python -c "import torch; print(torch.cuda.is_available())"

# Windows
.venv\Scripts\python -c "import torch; print(torch.cuda.is_available())"
```

Out of Memory Errors

Symptom: RuntimeError: CUDA out of memory or OutOfMemoryError

Solutions:

  1. Use a smaller model:
    bash # Try distil-medium instead of distil-large-v3 ./scripts/transcribe audio.mp3 --model distil-medium.en

  2. Use int8 quantization (reduces VRAM by ~4x):
    bash ./scripts/transcribe audio.mp3 --compute-type int8

  3. Fall back to CPU for large files:
    bash ./scripts/transcribe audio.mp3 --device cpu

  4. Split long audio files into smaller chunks (5-10 min segments)

VRAM Requirements:
| Model | float16 | int8 |
|-------|---------|------|
| distil-large-v3 | ~2GB | ~1GB |
| large-v3 | ~5GB | ~2GB |
| medium | ~3GB | ~1.5GB |
| small | ~2GB | ~1GB |

ffmpeg Not Found

Symptom: FileNotFoundError: ffmpeg not found

Solutions:

  1. Windows: Re-run setup script (auto-installs via winget)
    powershell .\setup.ps1

  2. Linux:
    ```bash
    # Ubuntu/Debian
    sudo apt install ffmpeg

# Fedora/RHEL
sudo dnf install ffmpeg

# Arch
sudo pacman -S ffmpeg
```

  1. macOS:
    bash brew install ffmpeg

  2. Verify installation:
    bash ffmpeg -version

Model Download Fails

Symptom: HTTPError, ConnectionError, or timeout during first run

Solutions:

  1. Check internet connection

  2. Retry with increased timeout:

  3. Models download automatically on first use
  4. Download sizes: 75MB (tiny) to 3GB (large-v3)

  5. Use a VPN if Hugging Face is blocked in your region

  6. Manual download:
    python from faster_whisper import WhisperModel model = WhisperModel("distil-large-v3", device="cpu")

Very Slow Transcription

Symptom: Transcription takes longer than the audio duration

Expected speeds:
- GPU (CUDA): ~20-30x realtime (30 min audio → ~1-2 min)
- Apple Silicon (CPU): ~2-5x realtime (30 min audio → ~6-15 min)
- Intel CPU: ~0.5-1x realtime (30 min audio → 30-60 min)

Solutions:

  1. Ensure GPU is being used:
    bash # Look for "Loading model: ... (cuda, float16) on NVIDIA ..." ./scripts/transcribe audio.mp3

  2. Use a smaller/distilled model:
    bash ./scripts/transcribe audio.mp3 --model distil-small.en

  3. Specify language (skips auto-detection):
    bash ./scripts/transcribe audio.mp3 --language en

  4. Reduce beam size:
    bash ./scripts/transcribe audio.mp3 --beam-size 1

Audio Format Issues

Symptom: Error: Unsupported format or transcription produces garbage

Solutions:

  1. Supported formats: MP3, WAV, M4A, FLAC, OGG, WebM
  2. Most common formats work via ffmpeg

  3. Convert problematic formats:
    bash ffmpeg -i input.xyz -ar 16000 output.mp3

  4. Check audio isn't corrupted:
    bash ffmpeg -i audio.mp3 -f null -

Python Version Issues

Symptom: SyntaxError or ImportError during setup

Solutions:

  1. Requires Python 3.10 or newer:
    bash python --version # or python3 --version

  2. Windows: Setup script auto-installs Python 3.12 via winget

  3. Linux/macOS: Install Python 3.10+:
    ```bash
    # Ubuntu
    sudo apt install python3.12 python3.12-venv

# macOS
brew install [email protected]
```

Still Having Issues?

  1. Check the logs: Run without --quiet to see detailed error messages
  2. Ask your agent: Paste the error — it can usually diagnose faster-whisper or installation issues
  3. Open an issue: GitHub Issues
  4. Include:
  5. Platform (Windows/Linux/macOS/WSL2)
  6. GPU model (if any)
  7. Python version
  8. Full error message

References

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.