Tokenizer-free TTS for multilingual speech — Nepali LoRA fine-tune walkthrough (data → train → compare → clone → design → Gradio).
Abstract. VoxCPM2 is a tokenizer-free text-to-speech system built on an end-to-end diffusion autoregressive stack (LocEnc → TSLM → RALM → LocDiT), trained on large multilingual audio. Nepali is not in the published 30-language list, but Hindi is — strong Devanagari priors make LoRA adaptation practical on ~5–6 hours of OpenSLR read speech, with 48 kHz output via AudioVAE V2.
Key features (upstream)
Contents (this guide)
Nepali is not in VoxCPM2’s official language list, but Hindi is — same Devanagari script, overlapping phonemes, and similar prosody. The model already encodes strong priors for “Nepali-like” sounds.
No VQ audio tokens: diffusion in latent space preserves micro-pitch, breath, and speaker nuance — why SIM scores stay high.
Encode at 16 kHz, decode at 48 kHz — training data is resampled to 16 kHz mono; output is studio-rate without an external vocoder.
LoRA sits on top of the base model: better Nepali pronunciation while voice design and RALM-based cloning still work.
~1–2.5% trainable params, smaller checkpoints, less forgetting — a good fit for ~5–6 hours of read speech.
Click a stage to dim the rest. This mirrors the pipeline in your project notes: text in, 48 kHz WAV out.
Low-rank adapters update a slice of weights: W' = W + B×A. You freeze the base 2B model and train tens of millions of parameters instead of billions.
enable_lm: true — Nepali text → content planenable_dit: true — Nepali acoustics in LocDiTr: 16, alpha: 32, dropout: 0.05batch_size: 4, grad_accum_steps: 8 → effective 32num_iters: 5000, lr: 2e-4Combined OpenSLR-43 and OpenSLR-143 give roughly 5–6 hours of clean read Nepali — above the few-minutes minimum for LoRA, with transcripts.
| Dataset | Speakers | Notes |
|---|---|---|
| OpenSLR-43 | 1 female | line_index.tsv + wavs/ |
| OpenSLR-143 | 1 M + 1 F | FemaleVoice.tsv / MaleVoice.tsv, WAVs alongside |
Training expects JSONL lines like:
All training audio: 16 kHz mono WAV. Use Prepare Manifest.py (or the cluster prepare_manifest.py in your notes) to resample and split train/val.
Select a step for a concise checklist. This follows Finetune.md sections 6–11.
Scripts in this folder mirror the workflow: compare base vs LoRA, clone a reference speaker, design voices from text. Hear the paired exports on listen.html (same layout as this guide).
test_compare.py loads the original checkpoint and the LoRA-augmented model, runs the same Nepali sentences, and writes paired WAVs (and optional A/B concatenations) under test_outputs/compare/.
Listen for clearer retroflex/aspirated stops, more natural cadence, and better sentence-final intonation on the fine-tuned side.
test_voice_clone.py passes a reference clip via generate(..., reference_wav_path=...) so RALM locks onto timbre while LoRA keeps Nepali phonetics. Outputs land in test_outputs/voice_clone/ (including a stitched reel with the reference).
Advanced modes in the full notes: basic (reference only), controllable (style in parentheses), ultimate (prompt audio + transcript for continuation).
Voice design prepends a natural-language speaker description in parentheses. LoRA improves Nepali pronunciation; descriptions are still mostly aligned with English/Chinese training, so specificity matters.
Tip: try cfg ≈ 2.0 and a few samples per prompt; official Gradio demo documents the same pattern.
Upstream app.py
exposes Text-to-Speech, Voice Design, Voice Cloning, and Ultimate Cloning tabs. On an HPC node, run python app.py --port 8808 and tunnel the port to your laptop.
pip install gradio if neededssh -L 8808:<compute-node>:8808 user@loginlora_ft_webui.py is handy on a local GPU; SLURM + YAML is usually better on a cluster.
Condensed from your troubleshooting log — expand for reminders.
pip install -eVoxCPM may pull a different torch constraint. Re-pin your CUDA build (e.g. 2.1.2+cu121) immediately after editable install.
~/.local/bin not on PATHExport PATH=$HOME/.local/bin:$PATH in SLURM scripts and shell init so CLIs resolve.
Use --config_path (not --config) with train_voxcpm_finetune.py — verify with --help.
Training manifests must be 16 kHz. Setting 48 kHz in YAML breaks encoder expectations.