VoxCPM2

Tokenizer-free TTS for multilingual speech — Nepali LoRA fine-tune walkthrough (data → train → compare → clone → design → Gradio).

Nepali TTS project · CWRU Pioneer HPC · April 2026 · Official VoxCPM2 demo page

Abstract. VoxCPM2 is a tokenizer-free text-to-speech system built on an end-to-end diffusion autoregressive stack (LocEnc → TSLM → RALM → LocDiT), trained on large multilingual audio. Nepali is not in the published 30-language list, but Hindi is — strong Devanagari priors make LoRA adaptation practical on ~5–6 hours of OpenSLR read speech, with 48 kHz output via AudioVAE V2.

Key features (upstream)

🌍 Multilingual — synthesize in supported languages without explicit language tags
🎨 Voice design — describe a novel voice in natural language; no reference clip required
🎛️ Controllable cloning — clone from a short clip; optional style steering while keeping timbre
🔊 48 kHz output — 16 kHz reference / encoder path; decoder super-resolves to studio rate

Contents (this guide)

Why VoxCPM2 for Nepali
Architecture: LocEnc → TSLM → RALM → LocDiT
LoRA configuration
Datasets & manifests
End-to-end pipeline
After training: compare, clone, design
Web demo (Gradio)
Troubleshooting

Curio.sity Listen to samples (results) VoxCPM on GitHub Official demo page OpenSLR-43 OpenSLR-143 Finetune.md (full notes)

Why this stack for Nepali

Nepali is not in VoxCPM2’s official language list, but Hindi is — same Devanagari script, overlapping phonemes, and similar prosody. The model already encodes strong priors for “Nepali-like” sounds.

Tokenizer-free

Continuous latents

No VQ audio tokens: diffusion in latent space preserves micro-pitch, breath, and speaker nuance — why SIM scores stay high.

48 kHz out

AudioVAE V2

Encode at 16 kHz, decode at 48 kHz — training data is resampled to 16 kHz mono; output is studio-rate without an external vocoder.

Voice extras

Design + cloning

LoRA sits on top of the base model: better Nepali pronunciation while voice design and RALM-based cloning still work.

HPC-ready

LoRA, not full SFT

~1–2.5% trainable params, smaller checkpoints, less forgetting — a good fit for ~5–6 hours of read speech.

How audio is produced (four stages)

Click a stage to dim the rest. This mirrors the pipeline in your project notes: text in, 48 kHz WAV out.

LoRA in this project

Low-rank adapters update a slice of weights: W' = W + B×A. You freeze the base 2B model and train tens of millions of parameters instead of billions.

Your YAML-style settings

enable_lm: true — Nepali text → content plan
enable_dit: true — Nepali acoustics in LocDiT
r: 16, alpha: 32, dropout: 0.05

Training snapshot

batch_size: 4, grad_accum_steps: 8 → effective 32
num_iters: 5000, lr: 2e-4
Single H100 80GB; ~25–30 GB VRAM typical for LoRA

Explore low-rank size r (illustrative parameter count per layer)

r = 16 → ~130K trainable params per 4096×4096 layer (B: 4096×r, A: r×4096)

Datasets and manifests

Combined OpenSLR-43 and OpenSLR-143 give roughly 5–6 hours of clean read Nepali — above the few-minutes minimum for LoRA, with transcripts.

Dataset	Speakers	Notes
OpenSLR-43	1 female	`line_index.tsv` + `wavs/`
OpenSLR-143	1 M + 1 F	`FemaleVoice.tsv` / `MaleVoice.tsv`, WAVs alongside

Training expects JSONL lines like:

{"audio": "/absolute/path/file.wav", "text": "…"}

All training audio: 16 kHz mono WAV. Use Prepare Manifest.py (or the cluster prepare_manifest.py in your notes) to resample and split train/val.

End-to-end pipeline

Select a step for a concise checklist. This follows Finetune.md sections 6–11.

After training: evaluation and creative modes

Scripts in this folder mirror the workflow: compare base vs LoRA, clone a reference speaker, design voices from text. Hear the paired exports on listen.html (same layout as this guide).

test_compare.py loads the original checkpoint and the LoRA-augmented model, runs the same Nepali sentences, and writes paired WAVs (and optional A/B concatenations) under test_outputs/compare/.

Listen for clearer retroflex/aspirated stops, more natural cadence, and better sentence-final intonation on the fine-tuned side.

नमस्ते। · नेपाल एक सुन्दर देश हो जहाँ हिमालय, पहाड र तराई क्षेत्र छन्।

test_voice_clone.py passes a reference clip via generate(..., reference_wav_path=...) so RALM locks onto timbre while LoRA keeps Nepali phonetics. Outputs land in test_outputs/voice_clone/ (including a stitched reel with the reference).

Advanced modes in the full notes: basic (reference only), controllable (style in parentheses), ultimate (prompt audio + transcript for continuation).

नमस्ते, मेरो नाम सारा हो।

Voice design prepends a natural-language speaker description in parentheses. LoRA improves Nepali pronunciation; descriptions are still mostly aligned with English/Chinese training, so specificity matters.

(A young Nepali woman, gentle and sweet voice, speaking softly) नमस्ते, म तपाईंलाई नेपालको बारेमा बताउन चाहन्छु।

Tip: try cfg ≈ 2.0 and a few samples per prompt; official Gradio demo documents the same pattern.

Web demo (Gradio)

Upstream app.py exposes Text-to-Speech, Voice Design, Voice Cloning, and Ultimate Cloning tabs. On an HPC node, run python app.py --port 8808 and tunnel the port to your laptop.

Quick launch pattern

Interactive GPU session + venv + CUDA modules
pip install gradio if needed
ssh -L 8808:<compute-node>:8808 user@login

LoRA WebUI (optional)

lora_ft_webui.py is handy on a local GPU; SLURM + YAML is usually better on a cluster.

Troubleshooting highlights

Condensed from your troubleshooting log — expand for reminders.

PyTorch version swapped after pip install -e

VoxCPM may pull a different torch constraint. Re-pin your CUDA build (e.g. 2.1.2+cu121) immediately after editable install.

~/.local/bin not on PATH

Export PATH=$HOME/.local/bin:$PATH in SLURM scripts and shell init so CLIs resolve.

Wrong training flag

Use --config_path (not --config) with train_voxcpm_finetune.py — verify with --help.

Sample rate mistakes

Training manifests must be 16 kHz. Setting 48 kHz in YAML breaks encoder expectations.

Built for the TTS_Nepali workspace · Content summarized from Finetune.md · E-ink theme, keyboard-friendly focus rings · WAV playback lives on listen.html (open from Downloads/TTS_Nepali/ so ../compare/ paths resolve). For hosting, copy WAVs into the site bundle (see the samples page).