omnicaptions-transcribe

Name: omnicaptions-transcribe
Rating: 5 (21 reviews)
Author: lattifai

by @lattifai in Development

# Install this skill:

npx skills add lattifai/omni-captions-skills --skill "omnicaptions-transcribe"

Install specific skill from multi-skill repository

# Description

Use when transcribing audio/video to text with timestamps, speaker labels, and chapters. Supports YouTube URLs and local files. Produces structured markdown output.

# SKILL.md

name: omnicaptions-transcribe
description: Use when transcribing audio/video to text with timestamps, speaker labels, and chapters. Supports YouTube URLs and local files. Produces structured markdown output.
allowed-tools: Bash(omnicaptions:*)

Gemini Transcription

Transcribe audio/video using Google Gemini API with structured markdown output.

YouTube Video Workflow

Important: Check for existing captions before transcribing:

1. Check captions: yt-dlp --list-subs "URL"
2. Has caption → Use /omnicaptions:download to get existing captions (better quality)
3. No caption → Transcribe directly with URL (don't download first!)

Confirm with user: Before transcribing, ask if they want to check for existing captions first.

URL & Local File Support

Gemini natively supports YouTube URLs - no need to download, just pass the URL directly:

# YouTube URL (recommended, no download needed)
omnicaptions transcribe "https://www.youtube.com/watch?v=VIDEO_ID"

# Local files
omnicaptions transcribe video.mp4

Note: Output defaults to current directory unless user specifies -o.

When to Use

Video URLs - YouTube, direct video links (Gemini native support)
Transcribing podcasts, interviews, lectures
Need verbatim transcript with timestamps and speaker labels
Want auto-generated chapters from content
Mixed-language audio (code-switching preserved)

When NOT to Use

Video has existing captions - Use /omnicaptions:download to get existing captions first
Need real-time streaming transcription (use Whisper)
Audio >2 hours (Gemini upload limit)
Want translation instead of transcription

Quick Reference

Method	Description
`transcribe(path)`	Transcribe file or URL (sync)
`translate(in, out, lang)`	Translate captions
`write(text, path)`	Save text to file

Setup

pip install https://github.com/lattifai/omni-captions-skills/raw/main/packages/lattifai_captions-0.1.0.tar.gz
pip install https://github.com/lattifai/omni-captions-skills/raw/main/packages/omnicaptions-0.1.0.tar.gz

API Key

Priority: GEMINI_API_KEY env → .env file → ~/.config/omnicaptions/config.json

If not set, ask user: Please enter your Gemini API key (get from https://aistudio.google.com/apikey):

Then run with -k <key>. Key will be saved to config file automatically.

CLI Usage

IMPORTANT: CLI requires subcommand (transcribe, translate, convert)

# Transcribe (auto-output to same directory)
omnicaptions transcribe video.mp4              # → ./video_GeminiUnd.md
omnicaptions transcribe "https://youtu.be/abc" # → ./abc_GeminiUnd.md

# Specify output file or directory
omnicaptions transcribe video.mp4 -o output/   # → output/video_GeminiUnd.md
omnicaptions transcribe video.mp4 -o my.md     # → my.md

# Options
omnicaptions transcribe -m gemini-3-pro-preview video.mp4
omnicaptions transcribe -l zh video.mp4  # Force Chinese

Option	Description
`-k, --api-key`	Gemini API key (auto-prompted if missing)
`-o, --output`	Output file or directory (default: auto)
`-m, --model`	Model (default: gemini-3-flash-preview)
`-l, --language`	Force language (zh, en, ja)
`-t, --translate LANG`	Translate to language (one-step)
`--bilingual`	Bilingual output (with -t)
`-v, --verbose`	Verbose output

Bilingual Captions (Optional)

If user requests bilingual output, add -t <lang> --bilingual:

omnicaptions transcribe video.mp4 -t zh --bilingual

For precise timing, use separate workflow: transcribe → LaiCut → translate (see Related Skills).

Output Format

## Table of Contents
* [00:00:00] Introduction
* [00:02:15] Main Topic

## [00:00:00] Introduction

**Host:** Welcome to the show. [00:00:01]

**Guest:** Thanks for having me. [00:00:05]

[Applause] [00:00:08]

Key features:
- ## [HH:MM:SS] Title chapter headers
- **Speaker:** labels (auto-detected)
- [HH:MM:SS] timestamp at paragraph end
- [Event] for non-speech (laughter, music)

Common Mistakes

Mistake	Fix
No API key error	Use `-k YOUR_KEY` or follow the prompt
Empty response	Check file format (mp3/mp4/wav/m4a supported)
Upload timeout	File too large (>2GB); split first
Wrong language	Use `-l en` to force language

Skill	Use When
`/omnicaptions:convert`	Convert output to SRT/VTT/ASS
`/omnicaptions:translate`	Translate (Gemini API or Claude native)
`/omnicaptions:download`	Download video/audio first

Workflow Examples

# Basic transcription
omnicaptions transcribe video.mp4
# → video_GeminiUnd.md

# Precise timing needed: transcribe → LaiCut align → convert
omnicaptions transcribe video.mp4
omnicaptions LaiCut video.mp4 video_GeminiUnd.md
# → video_GeminiUnd_LaiCut.json
omnicaptions convert video_GeminiUnd_LaiCut.json -o video_GeminiUnd_LaiCut.srt

Note: For translation, use /omnicaptions:translate (default: Claude, optional: Gemini API)

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

⚡ Amp 🚀 Antigravity 🤖 Claude Code 🦀 Clawdbot 📝 Codex ▶️ Cursor 🤖 Droid 💎 Gemini CLI 🐙 GitHub Copilot 🪿 Goose 📊 Kilo Code 🔧 Kiro CLI 💻 OpenCode 🦘 Roo Code 🌲 Trae 🏄 Windsurf

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.