VoxCPM2

Tokenizer-free TTS for multilingual speech — Nepali LoRA fine-tune walkthrough (data → train → compare → clone → design → Gradio).

Nepali TTS project · CWRU Pioneer HPC · April 2026 · Official VoxCPM2 demo page

Abstract. VoxCPM2 is a tokenizer-free text-to-speech system built on an end-to-end diffusion autoregressive stack (LocEnc → TSLM → RALM → LocDiT), trained on large multilingual audio. Nepali is not in the published 30-language list, but Hindi is — strong Devanagari priors make LoRA adaptation practical on ~5–6 hours of OpenSLR read speech, with 48 kHz output via AudioVAE V2.

Key features (upstream)

Contents (this guide)

Why this stack for Nepali

Nepali is not in VoxCPM2’s official language list, but Hindi is — same Devanagari script, overlapping phonemes, and similar prosody. The model already encodes strong priors for “Nepali-like” sounds.

Tokenizer-free

Continuous latents

No VQ audio tokens: diffusion in latent space preserves micro-pitch, breath, and speaker nuance — why SIM scores stay high.

48 kHz out

AudioVAE V2

Encode at 16 kHz, decode at 48 kHz — training data is resampled to 16 kHz mono; output is studio-rate without an external vocoder.

Voice extras

Design + cloning

LoRA sits on top of the base model: better Nepali pronunciation while voice design and RALM-based cloning still work.

HPC-ready

LoRA, not full SFT

~1–2.5% trainable params, smaller checkpoints, less forgetting — a good fit for ~5–6 hours of read speech.

How audio is produced (four stages)

Click a stage to dim the rest. This mirrors the pipeline in your project notes: text in, 48 kHz WAV out.

Text input (Devanagari / Nepali) Stage 1 — LocEnc Local text embeddings Stage 2 — TSLM (MiniCPM-4, ~2B) Semantic speech plan · Voice design prefix enters here Stage 3 — RALM (reference-aware) Speaker identity from reference audio · merging with the plan Stage 4 — LocDiT (diffusion in latent space) Flow-matching style generation AudioVAE V2 decoder 16 kHz latents → 48 kHz waveform 48 kHz WAV output

LoRA in this project

Low-rank adapters update a slice of weights: W' = W + B×A. You freeze the base 2B model and train tens of millions of parameters instead of billions.

Your YAML-style settings

  • enable_lm: true — Nepali text → content plan
  • enable_dit: true — Nepali acoustics in LocDiT
  • r: 16, alpha: 32, dropout: 0.05

Training snapshot

  • batch_size: 4, grad_accum_steps: 8 → effective 32
  • num_iters: 5000, lr: 2e-4
  • Single H100 80GB; ~25–30 GB VRAM typical for LoRA
r = 16 → ~130K trainable params per 4096×4096 layer (B: 4096×r, A: r×4096)

Datasets and manifests

Combined OpenSLR-43 and OpenSLR-143 give roughly 5–6 hours of clean read Nepali — above the few-minutes minimum for LoRA, with transcripts.

DatasetSpeakersNotes
OpenSLR-431 femaleline_index.tsv + wavs/
OpenSLR-1431 M + 1 FFemaleVoice.tsv / MaleVoice.tsv, WAVs alongside

Training expects JSONL lines like:

{"audio": "/absolute/path/file.wav", "text": "…"}

All training audio: 16 kHz mono WAV. Use Prepare Manifest.py (or the cluster prepare_manifest.py in your notes) to resample and split train/val.

End-to-end pipeline

Select a step for a concise checklist. This follows Finetune.md sections 6–11.

After training: evaluation and creative modes

Scripts in this folder mirror the workflow: compare base vs LoRA, clone a reference speaker, design voices from text. Hear the paired exports on listen.html (same layout as this guide).

test_compare.py loads the original checkpoint and the LoRA-augmented model, runs the same Nepali sentences, and writes paired WAVs (and optional A/B concatenations) under test_outputs/compare/.

Listen for clearer retroflex/aspirated stops, more natural cadence, and better sentence-final intonation on the fine-tuned side.

नमस्ते। · नेपाल एक सुन्दर देश हो जहाँ हिमालय, पहाड र तराई क्षेत्र छन्।

test_voice_clone.py passes a reference clip via generate(..., reference_wav_path=...) so RALM locks onto timbre while LoRA keeps Nepali phonetics. Outputs land in test_outputs/voice_clone/ (including a stitched reel with the reference).

Advanced modes in the full notes: basic (reference only), controllable (style in parentheses), ultimate (prompt audio + transcript for continuation).

नमस्ते, मेरो नाम सारा हो।

Voice design prepends a natural-language speaker description in parentheses. LoRA improves Nepali pronunciation; descriptions are still mostly aligned with English/Chinese training, so specificity matters.

(A young Nepali woman, gentle and sweet voice, speaking softly) नमस्ते, म तपाईंलाई नेपालको बारेमा बताउन चाहन्छु।

Tip: try cfg ≈ 2.0 and a few samples per prompt; official Gradio demo documents the same pattern.

Web demo (Gradio)

Upstream app.py exposes Text-to-Speech, Voice Design, Voice Cloning, and Ultimate Cloning tabs. On an HPC node, run python app.py --port 8808 and tunnel the port to your laptop.

Quick launch pattern

  • Interactive GPU session + venv + CUDA modules
  • pip install gradio if needed
  • ssh -L 8808:<compute-node>:8808 user@login

LoRA WebUI (optional)

lora_ft_webui.py is handy on a local GPU; SLURM + YAML is usually better on a cluster.

Troubleshooting highlights

Condensed from your troubleshooting log — expand for reminders.

PyTorch version swapped after pip install -e

VoxCPM may pull a different torch constraint. Re-pin your CUDA build (e.g. 2.1.2+cu121) immediately after editable install.

~/.local/bin not on PATH

Export PATH=$HOME/.local/bin:$PATH in SLURM scripts and shell init so CLIs resolve.

Wrong training flag

Use --config_path (not --config) with train_voxcpm_finetune.py — verify with --help.

Sample rate mistakes

Training manifests must be 16 kHz. Setting 48 kHz in YAML breaks encoder expectations.