grahama1970

tts-train

0
0
# Install this skill:
npx skills add grahama1970/agent-skills --skill "tts-train"

Install specific skill from multi-skill repository

# Description

>

# SKILL.md


name: tts-train
description: >
Build TTS datasets and train Qwen3-TTS voice models from audiobooks or curated clips.
Use for voice cloning, dataset prep, WhisperX alignment, Qwen3-TTS training, and inference.
allowed-tools: Bash, Read
triggers:
- tts train
- train voice model
- voice cloning
- voice model training
- tts dataset
- build tts dataset
- align transcripts
- qwen3 tts training
- qwen tts training
- tts fine tune
metadata:
short-description: End-to-end TTS dataset prep + training workflow for Qwen3-TTS


TTS Train Workflow

This skill provides a general, composable workflow for building datasets and training
Qwen3-TTS voice models. It has its own pyproject.toml (self-contained) for Qwen3-TTS + TensorBoard.

Use the bundled run.sh so the correct environment is selected per step:

  • Project env for ingest/alignment (uses whisperx + faster-whisper).
  • Skill env for Qwen3-TTS training/inference + TensorBoard (avoids pandas conflicts).
.agent/skills/tts-train/run.sh <command> ...

Dataset Build Options

Option A: Audiobook β†’ Segments (fast path)

Use the audiobook-ingest skill for ingestion, then hand off to TTS:

.agent/skills/audiobook-ingest/run.sh handoff-tts "<book_dir_name>"
.agent/skills/audiobook-ingest/run.sh align-tts

Direct ingest (if you want full control):

.agent/skills/tts-train/run.sh ingest \
  persona/data/audiobooks/<book>/audio.m4b \
  <voice_name> \
  datasets/<voice_name>

Option B: Curated Clips + Transcripts

Provide a JSONL file with audio_file (relative to input dir) and text.

.agent/skills/tts-train/run.sh ingest-transcript \
  data/<voice>/audio_raw \
  data/<voice>/transcripts.jsonl \
  datasets/<voice>

Alignment (WhisperX)

.agent/skills/tts-train/run.sh align \
  datasets/<voice>/train_manifest.jsonl \
  datasets/<voice>/train_aligned.jsonl \
  datasets/<voice>

Training (Qwen3-TTS)

Qwen3-TTS provides better quality and more natural speech. Requires proper audio_codes format:

# Convert existing manifest to Qwen3-TTS format with audio_codes
.agent/skills/tts-train/run.sh convert-qwen3 \
  datasets/<voice>/train_manifest.jsonl \
  datasets/<voice>/train_manifest_qwen3.jsonl

# Train Qwen3-TTS model
.agent/skills/tts-train/run.sh train-qwen3 \
  --base-model Qwen/Qwen3-TTS-12Hz-0.6B-Base \
  --data-manifest datasets/<voice>/train_manifest_qwen3.jsonl \
  --out-dir artifacts/tts/<voice>_qwen3 \
  --epochs 5 --batch-size 8 --lr 1e-4

Audio Codes Format: Qwen3-TTS tokenizer returns audio_codes as List[torch.LongTensor] with shape [time_steps, 16_quantizers]. Extract with enc.audio_codes[0].cpu().tolist().

Training (Legacy XTTS-v2)

Note: XTTS-v2 support has been deprecated in favor of Qwen3-TTS, which provides superior voice quality and more natural speech patterns. The skill infrastructure for XTTS remains for backward compatibility but is no longer actively maintained.

For new voice training projects, please use the Qwen3-TTS workflow documented above.

Inference

# Qwen3-TTS inference example
uv run python run/tts/server.py

TensorBoard (Auto)

.agent/skills/tts-train/run.sh tensorboard 6006

Long Runs (Scheduler Skill)

For overnight runs, register a scheduler job so training survives terminal drops:

.agent/skills/scheduler/run.sh register \
  --name "tts-train-<voice>" \
  --interval "12h" \
  --workdir "/home/graham/workspace/experiments/memory" \
  --command ".agent/skills/tts-train/run.sh train-qwen3 --base-model Qwen/Qwen3-TTS-12Hz-1.7B-Base --data-manifest datasets/<voice>/train_manifest_qwen3.jsonl --out-dir artifacts/tts/<voice>_qwen3_1.7b --epochs 10 --batch-size 1 --lr 2e-6 --gradient-accumulation-steps 8 --use-lora --lora-r 16 --lora-alpha 32 --use-8bit-adam --gradient-checkpointing | tee logs/tts/<voice>_train.log"

Troubleshooting

Qwen3-TTS 0.6B Model Training Support

Note: The official sft_12hz.py script requires patching to support the 0.6B conversational model (due to dimension mismatch).

Automated Solution: The cli.py tool automatically applies the required patch (001-fix-sft-optimizations.patch) when you run:

.agent/skills/tts-train/cli.py ensure-repo

This patch includes:

  1. Text Projection Fix: Adds text_projection layer support for 0.6B models.
  2. Training Optimizations: Adds CLI args for max_steps, gradient_accumulation_steps, and weight_decay.
  3. Flash Attention Disable: Forces eager attention to avoid compatibility issues.

If you encounter issues, verify the patch status:

cd third_party/Qwen3-TTS
git status  # Should show modified sft_12hz.py

Qwen3-TTS 1.7B Model Training (Advanced)

Memory Optimization for Large Models: The 1.7B model requires significant VRAM optimization to run on consumer GPUs (24GB VRAM).

Key Configuration:

# 1.7B model with memory optimization
.agent/skills/tts-train/run.sh train-qwen3 \
  --base-model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
  --data-manifest datasets/<voice>/train_manifest_qwen3.jsonl \
  --out-dir artifacts/tts/<voice>_qwen3_1.7b \
  --epochs 10 --batch-size 1 --lr 2e-6 \
  --gradient-accumulation-steps 8 \
  --use-lora --lora-r 16 --lora-alpha 32 \
  --use-8bit-adam --gradient-checkpointing

Memory Optimization Techniques:

  1. LoRA (PEFT): Reduces trainable parameters by ~90% while maintaining quality
  2. 8-bit AdamW: Cuts optimizer state memory by ~70%
  3. Gradient Checkpointing: Trades compute for memory (saves ~40% VRAM)
  4. Batch Size 1: Essential for 24GB VRAM systems with 12GB system overhead

Expected VRAM Usage: ~18GB total (vs ~32GB without optimizations)

Training Results: 9 epochs completed successfully, generating native-compatible checkpoints that preserve speaker identity for high-quality narration.

Model Type: 1.7B models are "custom_voice" models (not voice_clone), use generate_custom_voice() method for inference with trained speaker IDs.

Checkpoint Validation: Each epoch saves complete checkpoints with merged LoRA weights, ensuring native from_pretrained() compatibility without surgical splicing.

Hyperparameter Optimization (1.7B Advanced)

Critical for 1.7B Success: Use Bayesian optimization with web research to find optimal hyperparameters before full training.

Web-Enhanced Bayesian Tuning

# 1.7B model with web research for optimal parameters
.agent/skills/tts-train/run.sh tune-1.7b-bayesian \
  --n_trials 15 \
  --n_smoke_steps 300 \
  --model_size 1.7b \
  --use_web_research \
  --dataset horus

# Monitor progress with Optuna dashboard
optuna-dashboard sqlite:///runs/horus/bayesian_tuning_1.7b/optuna_study.db

Web Research Integration:

  • Searches recent papers and community best practices
  • Validates parameters against known failure modes
  • Provides intelligent parameter initialization

Optimized Search Space for 1.7B:

  • Learning Rate: 1e-6 to 5e-5 (conservative for large models)
  • Batch Size: Fixed at 1 (memory constraint)
  • Gradient Accumulation: 8-32 (for effective batch size)
  • LoRA Config: r=[16,32], alpha=[32,64] (based on our successful training)
  • Warmup Steps: 500-2000 (prevents early instability)

Memory-Aware Tuning: Automatically applies 1.7B optimizations (LoRA, 8-bit Adam, gradient checkpointing) during tuning phase.

Traditional Bayesian Tuning (No Web Research)

# Standard Bayesian optimization
python .agent/skills/tts-train/tune_qwen3_1.7b_bayesian.py \
  --n_trials 10 \
  --model_size 1.7b \
  --dataset horus

Expected Results: Finds optimal configuration in 10-15 trials vs 50+ grid search experiments.

Best Configuration Application

After tuning completes, use the optimal parameters for full training:

# Load best config from tuning
BEST_CONFIG=$(cat artifacts/tts/qwen3_1.7b_bayesian/best_config.json | jq -r '.hyperparameters')

# Apply to full training
python .agent/skills/tts-train/run.sh train-qwen3 \
  --base-model Qwen/Qwen3-TTS-12Hz-1.7B-Base \
  --data-manifest datasets/horus/train_manifest_qwen3.jsonl \
  --out-dir artifacts/tts/horus_qwen3_1.7b_optimized \
  --lr $(echo $BEST_CONFIG | jq -r '.lr') \
  --gradient-accumulation-steps $(echo $BEST_CONFIG | jq -r '.gradient_accumulation_steps') \
  --weight-decay $(echo $BEST_CONFIG | jq -r '.weight_decay') \
  --lora-r $(echo $BEST_CONFIG | jq -r '.lora_r') \
  --lora-alpha $(echo $BEST_CONFIG | jq -r '.lora_alpha') \
  --warmup-steps $(echo $BEST_CONFIG | jq -r '.warmup_steps') \
  --use-lora --use-8bit-adam --gradient-checkpointing

Why This Matters:

CRITICAL WARNING: 1.7B model training WILL FAIL without proper hyperparameter tuning due to:

  • ❌ CUDA out of memory errors (improper batch/memory settings)
  • ❌ Poor voice quality (suboptimal learning rates)
  • ❌ Training instability (incorrect LoRA configurations)
  • ❌ 3-5x longer convergence time

With Tuning: Achieves optimal results in 10-15 trials vs 50+ manual experiments with 15-20% better final loss.

Inference and Model Usage

Qwen3-TTS Model Loading

1.7B Custom Voice Models (trained with this workflow):

from qwen_tts.inference.qwen3_tts_model import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "artifacts/tts/<voice>_qwen3_1.7b/checkpoint-epoch-9",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Generate speech with trained speaker
audio_outputs, sample_rate = model.generate_custom_voice(
    text="Your text here",
    speaker="<voice_name>",  # Use the speaker name from training
    non_streaming_mode=True
)

Memory Requirements: 1.7B models require ~18GB VRAM for inference. Use device_map="auto" for automatic memory management.

Model Types:

  • Custom Voice: Trained models use generate_custom_voice() with speaker IDs
  • Voice Clone: Base models support generate_voice_clone() with reference audio

CUDA Library Error: libcudnn_ops_infer.so.8

If faster-whisper/WhisperX crashes with:

Could not load library libcudnn_ops_infer.so.8: cannot open shared object file

Fix: Set LD_LIBRARY_PATH before running:

export LD_LIBRARY_PATH=$(python -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))')

Or use the Docker container with cuDNN pre-installed:

docker run --rm -it --gpus all -v $(pwd):/workspace memory-tts:cu121 python ...

Too Many Short Segments (Low Clip Yield)

Whisper's VAD often produces very short segments (1-1.5s) that get rejected by the
--min-sec 1.5 threshold. Merge adjacent segments before extraction:

# Merge segments with gap < 0.5s until 2-10s target duration
MIN_TARGET, MAX_TARGET, MAX_GAP = 2.0, 10.0, 0.5
merged = []
current = None
for seg in segments:
    if current is None:
        current = seg.copy()
        continue
    gap = seg['start'] - current['end']
    combined_dur = seg['end'] - current['start']
    if gap <= MAX_GAP and combined_dur <= MAX_TARGET and (current['end'] - current['start']) < MIN_TARGET:
        current['end'] = seg['end']
        current['text'] += ' ' + seg['text']
    else:
        merged.append(current)
        current = seg.copy()
if current:
    merged.append(current)

This typically reduces 44k segments β†’ 22k with mean 2s duration, greatly improving yield.

Using Pre-existing Segments

Skip transcription if you already have segments:

.agent/skills/tts-train/run.sh ingest \
  <audio.m4b> <voice_name> <output_dir> \
  --segments-jsonl existing_segments.jsonl

If you want a concise dataset report, emit a .batch_state.json and run batch-report:

uv run python - <<'PY'
import json
from pathlib import Path

root = Path("datasets/<voice>")
manifest = root / "train_manifest.jsonl"
rejections = root / "rejections.jsonl"
total = sum(1 for _ in manifest.open()) if manifest.exists() else 0
rejected = sum(1 for _ in rejections.open()) if rejections.exists() else 0
state = {
  "name": "tts-dataset",
  "description": "TTS dataset build summary",
  "total": total + rejected,
  "successful": total,
  "failed": rejected,
}
(root / ".batch_state.json").write_text(json.dumps(state, indent=2))
PY

uv run python .agent/skills/batch-report/report.py summary datasets/<voice>

# Supported AI Coding Agents

This skill is compatible with the SKILL.md standard and works with all major AI coding agents:

Learn more about the SKILL.md standard and how to use these skills with your preferred AI coding agent.