Refactor high-complexity React components in Dify frontend. Use when `pnpm analyze-component...
npx skills add grahama1970/agent-skills --skill "tts-train"
Install specific skill from multi-skill repository
# Description
>
# SKILL.md
name: tts-train
description: >
Build TTS datasets and train Qwen3-TTS voice models from audiobooks or curated clips.
Use for voice cloning, dataset prep, WhisperX alignment, Qwen3-TTS training, and inference.
allowed-tools: Bash, Read
triggers:
- tts train
- train voice model
- voice cloning
- voice model training
- tts dataset
- build tts dataset
- align transcripts
- qwen3 tts training
- qwen tts training
- tts fine tune
metadata:
short-description: End-to-end TTS dataset prep + training workflow for Qwen3-TTS
TTS Train Workflow
This skill provides a general, composable workflow for building datasets and training
Qwen3-TTS voice models. It has its own pyproject.toml (self-contained) for Qwen3-TTS + TensorBoard.
Use the bundled run.sh so the correct environment is selected per step:
- Project env for ingest/alignment (uses
whisperx+faster-whisper). - Skill env for Qwen3-TTS training/inference + TensorBoard (avoids pandas conflicts).
.agent/skills/tts-train/run.sh <command> ...
Dataset Build Options
Option A: Audiobook β Segments (fast path)
Use the audiobook-ingest skill for ingestion, then hand off to TTS:
.agent/skills/audiobook-ingest/run.sh handoff-tts "<book_dir_name>"
.agent/skills/audiobook-ingest/run.sh align-tts
Direct ingest (if you want full control):
.agent/skills/tts-train/run.sh ingest \
persona/data/audiobooks/<book>/audio.m4b \
<voice_name> \
datasets/<voice_name>
Option B: Curated Clips + Transcripts
Provide a JSONL file with audio_file (relative to input dir) and text.
.agent/skills/tts-train/run.sh ingest-transcript \
data/<voice>/audio_raw \
data/<voice>/transcripts.jsonl \
datasets/<voice>
Alignment (WhisperX)
.agent/skills/tts-train/run.sh align \
datasets/<voice>/train_manifest.jsonl \
datasets/<voice>/train_aligned.jsonl \
datasets/<voice>
Training (Qwen3-TTS)
Qwen3-TTS provides better quality and more natural speech. Requires proper audio_codes format:
# Convert existing manifest to Qwen3-TTS format with audio_codes
.agent/skills/tts-train/run.sh convert-qwen3 \
datasets/<voice>/train_manifest.jsonl \
datasets/<voice>/train_manifest_qwen3.jsonl
# Train Qwen3-TTS model
.agent/skills/tts-train/run.sh train-qwen3 \
--base-model Qwen/Qwen3-TTS-12Hz-0.6B-Base \
--data-manifest datasets/<voice>/train_manifest_qwen3.jsonl \
--out-dir artifacts/tts/<voice>_qwen3 \
--epochs 5 --batch-size 8 --lr 1e-4
Audio Codes Format: Qwen3-TTS tokenizer returns audio_codes as List[torch.LongTensor] with shape [time_steps, 16_quantizers]. Extract with enc.audio_codes[0].cpu().tolist().
Training (Legacy XTTS-v2)
Note: XTTS-v2 support has been deprecated in favor of Qwen3-TTS, which provides superior voice quality and more natural speech patterns. The skill infrastructure for XTTS remains for backward compatibility but is no longer actively maintained.
For new voice training projects, please use the Qwen3-TTS workflow documented above.
Inference
# Qwen3-TTS inference example
uv run python run/tts/server.py
TensorBoard (Auto)
.agent/skills/tts-train/run.sh tensorboard 6006
Long Runs (Scheduler Skill)
For overnight runs, register a scheduler job so training survives terminal drops:
.agent/skills/scheduler/run.sh register \
--name "tts-train-<voice>" \
--interval "12h" \
--workdir "/home/graham/workspace/experiments/memory" \
--command ".agent/skills/tts-train/run.sh train-qwen3 --base-model Qwen/Qwen3-TTS-12Hz-1.7B-Base --data-manifest datasets/<voice>/train_manifest_qwen3.jsonl --out-dir artifacts/tts/<voice>_qwen3_1.7b --epochs 10 --batch-size 1 --lr 2e-6 --gradient-accumulation-steps 8 --use-lora --lora-r 16 --lora-alpha 32 --use-8bit-adam --gradient-checkpointing | tee logs/tts/<voice>_train.log"
Troubleshooting
Qwen3-TTS 0.6B Model Training Support
Note: The official sft_12hz.py script requires patching to support the 0.6B conversational model (due to dimension mismatch).
Automated Solution: The cli.py tool automatically applies the required patch (001-fix-sft-optimizations.patch) when you run:
.agent/skills/tts-train/cli.py ensure-repo
This patch includes:
- Text Projection Fix: Adds
text_projectionlayer support for 0.6B models. - Training Optimizations: Adds CLI args for
max_steps,gradient_accumulation_steps, andweight_decay. - Flash Attention Disable: Forces
eagerattention to avoid compatibility issues.
If you encounter issues, verify the patch status:
cd third_party/Qwen3-TTS
git status # Should show modified sft_12hz.py
Qwen3-TTS 1.7B Model Training (Advanced)
Memory Optimization for Large Models: The 1.7B model requires significant VRAM optimization to run on consumer GPUs (24GB VRAM).
Key Configuration:
# 1.7B model with memory optimization
.agent/skills/tts-train/run.sh train-qwen3 \
--base-model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--data-manifest datasets/<voice>/train_manifest_qwen3.jsonl \
--out-dir artifacts/tts/<voice>_qwen3_1.7b \
--epochs 10 --batch-size 1 --lr 2e-6 \
--gradient-accumulation-steps 8 \
--use-lora --lora-r 16 --lora-alpha 32 \
--use-8bit-adam --gradient-checkpointing
Memory Optimization Techniques:
- LoRA (PEFT): Reduces trainable parameters by ~90% while maintaining quality
- 8-bit AdamW: Cuts optimizer state memory by ~70%
- Gradient Checkpointing: Trades compute for memory (saves ~40% VRAM)
- Batch Size 1: Essential for 24GB VRAM systems with 12GB system overhead
Expected VRAM Usage: ~18GB total (vs ~32GB without optimizations)
Training Results: 9 epochs completed successfully, generating native-compatible checkpoints that preserve speaker identity for high-quality narration.
Model Type: 1.7B models are "custom_voice" models (not voice_clone), use generate_custom_voice() method for inference with trained speaker IDs.
Checkpoint Validation: Each epoch saves complete checkpoints with merged LoRA weights, ensuring native from_pretrained() compatibility without surgical splicing.
Hyperparameter Optimization (1.7B Advanced)
Critical for 1.7B Success: Use Bayesian optimization with web research to find optimal hyperparameters before full training.
Web-Enhanced Bayesian Tuning
# 1.7B model with web research for optimal parameters
.agent/skills/tts-train/run.sh tune-1.7b-bayesian \
--n_trials 15 \
--n_smoke_steps 300 \
--model_size 1.7b \
--use_web_research \
--dataset horus
# Monitor progress with Optuna dashboard
optuna-dashboard sqlite:///runs/horus/bayesian_tuning_1.7b/optuna_study.db
Web Research Integration:
- Searches recent papers and community best practices
- Validates parameters against known failure modes
- Provides intelligent parameter initialization
Optimized Search Space for 1.7B:
- Learning Rate: 1e-6 to 5e-5 (conservative for large models)
- Batch Size: Fixed at 1 (memory constraint)
- Gradient Accumulation: 8-32 (for effective batch size)
- LoRA Config: r=[16,32], alpha=[32,64] (based on our successful training)
- Warmup Steps: 500-2000 (prevents early instability)
Memory-Aware Tuning: Automatically applies 1.7B optimizations (LoRA, 8-bit Adam, gradient checkpointing) during tuning phase.
Traditional Bayesian Tuning (No Web Research)
# Standard Bayesian optimization
python .agent/skills/tts-train/tune_qwen3_1.7b_bayesian.py \
--n_trials 10 \
--model_size 1.7b \
--dataset horus
Expected Results: Finds optimal configuration in 10-15 trials vs 50+ grid search experiments.
Best Configuration Application
After tuning completes, use the optimal parameters for full training:
# Load best config from tuning
BEST_CONFIG=$(cat artifacts/tts/qwen3_1.7b_bayesian/best_config.json | jq -r '.hyperparameters')
# Apply to full training
python .agent/skills/tts-train/run.sh train-qwen3 \
--base-model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--data-manifest datasets/horus/train_manifest_qwen3.jsonl \
--out-dir artifacts/tts/horus_qwen3_1.7b_optimized \
--lr $(echo $BEST_CONFIG | jq -r '.lr') \
--gradient-accumulation-steps $(echo $BEST_CONFIG | jq -r '.gradient_accumulation_steps') \
--weight-decay $(echo $BEST_CONFIG | jq -r '.weight_decay') \
--lora-r $(echo $BEST_CONFIG | jq -r '.lora_r') \
--lora-alpha $(echo $BEST_CONFIG | jq -r '.lora_alpha') \
--warmup-steps $(echo $BEST_CONFIG | jq -r '.warmup_steps') \
--use-lora --use-8bit-adam --gradient-checkpointing
Why This Matters:
CRITICAL WARNING: 1.7B model training WILL FAIL without proper hyperparameter tuning due to:
- β CUDA out of memory errors (improper batch/memory settings)
- β Poor voice quality (suboptimal learning rates)
- β Training instability (incorrect LoRA configurations)
- β 3-5x longer convergence time
With Tuning: Achieves optimal results in 10-15 trials vs 50+ manual experiments with 15-20% better final loss.
Inference and Model Usage
Qwen3-TTS Model Loading
1.7B Custom Voice Models (trained with this workflow):
from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"artifacts/tts/<voice>_qwen3_1.7b/checkpoint-epoch-9",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Generate speech with trained speaker
audio_outputs, sample_rate = model.generate_custom_voice(
text="Your text here",
speaker="<voice_name>", # Use the speaker name from training
non_streaming_mode=True
)
Memory Requirements: 1.7B models require ~18GB VRAM for inference. Use device_map="auto" for automatic memory management.
Model Types:
- Custom Voice: Trained models use
generate_custom_voice()with speaker IDs - Voice Clone: Base models support
generate_voice_clone()with reference audio
CUDA Library Error: libcudnn_ops_infer.so.8
If faster-whisper/WhisperX crashes with:
Could not load library libcudnn_ops_infer.so.8: cannot open shared object file
Fix: Set LD_LIBRARY_PATH before running:
export LD_LIBRARY_PATH=$(python -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))')
Or use the Docker container with cuDNN pre-installed:
docker run --rm -it --gpus all -v $(pwd):/workspace memory-tts:cu121 python ...
Too Many Short Segments (Low Clip Yield)
Whisper's VAD often produces very short segments (1-1.5s) that get rejected by the
--min-sec 1.5 threshold. Merge adjacent segments before extraction:
# Merge segments with gap < 0.5s until 2-10s target duration
MIN_TARGET, MAX_TARGET, MAX_GAP = 2.0, 10.0, 0.5
merged = []
current = None
for seg in segments:
if current is None:
current = seg.copy()
continue
gap = seg['start'] - current['end']
combined_dur = seg['end'] - current['start']
if gap <= MAX_GAP and combined_dur <= MAX_TARGET and (current['end'] - current['start']) < MIN_TARGET:
current['end'] = seg['end']
current['text'] += ' ' + seg['text']
else:
merged.append(current)
current = seg.copy()
if current:
merged.append(current)
This typically reduces 44k segments β 22k with mean 2s duration, greatly improving yield.
Using Pre-existing Segments
Skip transcription if you already have segments:
.agent/skills/tts-train/run.sh ingest \
<audio.m4b> <voice_name> <output_dir> \
--segments-jsonl existing_segments.jsonl
If you want a concise dataset report, emit a .batch_state.json and run batch-report:
uv run python - <<'PY'
import json
from pathlib import Path
root = Path("datasets/<voice>")
manifest = root / "train_manifest.jsonl"
rejections = root / "rejections.jsonl"
total = sum(1 for _ in manifest.open()) if manifest.exists() else 0
rejected = sum(1 for _ in rejections.open()) if rejections.exists() else 0
state = {
"name": "tts-dataset",
"description": "TTS dataset build summary",
"total": total + rejected,
"successful": total,
"failed": rejected,
}
(root / ".batch_state.json").write_text(json.dumps(state, indent=2))
PY
uv run python .agent/skills/batch-report/report.py summary datasets/<voice>
# Supported AI Coding Agents
This skill is compatible with the SKILL.md standard and works with all major AI coding agents:
Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.