Use when you have a written implementation plan to execute in a separate session with review checkpoints
npx skills add grahama1970/agent-skills --skill "tts-horus"
Install specific skill from multi-skill repository
# Description
>
# SKILL.md
name: tts-horus
description: >
Build and operate the Horus TTS pipeline from cleared audiobooks.
Includes dataset prep, WhisperX alignment, XTTS training, voice coloring,
and persona inference helpers.
allowed-tools: Bash, Read
triggers:
- horus tts
- build horus voice
- voice coloring
- tts pipeline
metadata:
short-description: Horus TTS dataset + training + inference pipeline
Horus TTS Pipeline
This skill standardizes the end-to-end Horus voice workflow using uvx-style
invocations so dependencies are isolated per command. It assumes recordings
live under persona/data/audiobooks/ with per-book clean/ folders.
Commands
dataset
Builds the Horus dataset from Horus Rising using faster-whisper segmentation.
# Step 1: Transcribe and segment
python run/tts/ingest_audiobook.py \
--audio persona/data/audiobooks/Horus_Rising_The_Horus_Heresy_Book_1-LC_64_22050_stereo/audio.m4b \
--book-name Horus_Rising \
--output-dir datasets/horus_voice \
--max-hours 0
# Step 2: Extract clips (CPU-bound, uses ffmpeg)
python run/tts/extract_clips.py \
--segments datasets/horus_voice/segments_merged.jsonl \
--audio persona/data/audiobooks/Horus_Rising_The_Horus_Heresy_Book_1-LC_64_22050_stereo/audio.m4b \
--output-dir datasets/horus_voice/clips \
--manifest datasets/horus_voice/full_manifest.jsonl \
--min-sec 1.5 --max-sec 15.0 --sample-rate 24000 --max-hours 0
align
Runs WhisperX alignment with lexicon overrides.
python run/tts/align_transcripts.py \
--manifest datasets/horus_voice/train_manifest.jsonl \
--output datasets/horus_voice/train_aligned.jsonl \
--dataset-root datasets/horus_voice \
--lexicon persona/docs/lexicon_overrides.json \
--strategy whisperx --device cuda
train
Fine-tunes XTTS-v2 using GPTTrainer (local A5000 by default).
python run/tts/train_xtts_coqui.py --config configs/tts/horus_xtts.yaml
Important: Uses train_xtts_coqui.py with GPTTrainer (not generic Trainer).
Must use mixed_precision=False to avoid NaN losses.
say
CLI synthesis (writes .artifacts/tts/output.wav by default).
python run/tts/say.py "Lupercal speaks."
server
FastAPI server for low-latency synthesis.
python run/tts/server.py
color
Voice coloring helper (to be implemented in Task 5).
python run/tts/color_voice.py --base horus --color warm --alpha 0.4
Notes
- WhisperX is installed in the project
.venv; for standalone runs, useuvx whisperxif needed. - Golden samples live in
tests/fixtures/tts/golden/(Git LFS). - Orchestrate-ready task plan lives at
persona/docs/tasks/0N_voice_coloring.md.
Gotchas
| Issue | Solution |
|---|---|
| NaN losses from step 0 | Use mixed_precision=False, precision="float32" |
| Model size mismatch (1026 vs 8194) | Use GPTTrainer with GPTArgs, not XttsConfig |
| Missing dvae.pth | Download from HuggingFace: wget https://huggingface.co/coqui/XTTS-v2/resolve/main/dvae.pth |
| Clips too short (rejected) | Merge adjacent segments: MIN_TARGET=2.0s, MAX_GAP=0.5s |
| Clip extraction slow | Normal - uses ffmpeg (CPU-bound), GPU only for training |
| Learning rate | Use 5e-6 (official recipe), NOT default 2e-4 |
| Batch size | batch_size * grad_accumulation >= 252 for efficient training |
| LR milestones never reached | Script now auto-calculates based on dataset size |
Learning Rate Schedule
The training script uses professional ML best practices:
LR Schedule: CosineAnnealingWarmRestarts (recommended for XTTS)
- T_0: ~20% of total steps (first annealing cycle)
- T_mult: 2 (each subsequent cycle doubles in length)
- eta_min: 1e-7 (minimum LR floor)
- Gradient clipping: 1.0
- Optimizer: AdamW with weight decay 1e-2
Why CosineAnnealingWarmRestarts over MultiStepLR:
- Smoother LR decay (better for fine-tuning)
- Periodic restarts help escape local minima
- Recommended by Coqui community for XTTS
Double descent is less relevant for fine-tuning because:
1. Model already trained - we're adapting, not learning from scratch
2. Single speaker - we want specialization to Horus's voice
3. Risk is memorization, mitigated by diverse training data (~18k clips)
Sources:
- AllTalk TTS Finetuning Guide
- Coqui TTS Discussion #3773
Automated Pipeline
The TTS pipeline can run fully automated from extraction to training:
# Start extraction, then auto-train when complete
./run/tts/auto_train_after_extraction.sh [extraction_pid]
# Or run the full pipeline from scratch
python run/tts/horus_pipeline.py --skip-align
The auto script will:
1. Monitor extraction progress (logs every 60s)
2. Create train/val manifests (98%/2% split)
3. Build Coqui metadata.csv
4. Start XTTS training (200 epochs)
Logs: logs/tts/pipeline_*.log
Current Status (2026-01-23)
- Audiobook: Horus Rising (~12h)
- Extraction: In progress (~9,400 / ~21,975 clips)
- Auto-monitor: Running (PID in
logs/tts/auto_pipeline.log) - Config:
configs/tts/horus_xtts.yaml(200 epochs, 5e-6 LR) - Training: Will auto-start after extraction
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.