# VoxCPM2 Nepali TTS — Complete Project Notes

> **Cluster:** CWRU Pioneer HPC  
> **GPU:** 1× NVIDIA H100 80GB (originally planned 2× L40S — see §5)  
> **Method:** LoRA fine-tuning  
> **Date:** April 2026

---

## Table of Contents

1. [Why VoxCPM2?](#1-why-voxcpm2)
2. [How VoxCPM2 Works — Architecture Deep Dive](#2-architecture)
3. [What Is LoRA and Why We Chose It](#3-lora)
4. [Datasets — OpenSLR-43 & OpenSLR-143](#4-datasets)
5. [Why H100 Instead of L40S](#5-gpu)
6. [Step-by-Step: Environment Setup](#6-env)
7. [Step-by-Step: Model & Data Download](#7-download)
8. [Step-by-Step: Data Preparation](#8-data-prep)
9. [Step-by-Step: YAML Configuration](#9-yaml)
10. [Step-by-Step: SLURM Script](#10-slurm)
11. [Step-by-Step: Submit & Monitor Training](#11-train)
12. [Troubleshooting Log](#12-troubleshooting)
13. [Testing — Single Model Inference](#13-test-single)
14. [Testing — Original vs Fine-tuned Comparison](#14-test-compare)
15. [Voice Cloning with Fine-tuned Model](#15-voice-cloning)
16. [Voice Design with Fine-tuned Model](#16-voice-design)
17. [Web Demo (Gradio)](#17-web-demo)
18. [Project Directory Structure](#18-directory)
19. [References & Citation](#19-references)

---

## 1. Why VoxCPM2?

We needed a TTS model to generate **high-quality Nepali speech**. Nepali is not among VoxCPM2's 30 official languages, but **Hindi is** — and Hindi shares the Devanagari script, many phonemes, and prosodic patterns with Nepali. This means the pretrained model already has strong acoustic priors for Nepali-like sounds.

### Comparison with alternatives

| Model | Params | Languages | Voice Design | Voice Cloning | Sample Rate | Open-Source | License |
|-------|--------|-----------|:------------:|:-------------:|:-----------:|:-----------:|---------|
| **VoxCPM2** | 2B | 30 | ✅ | ✅ (3 modes) | 48kHz | ✅ | Apache-2.0 |
| CosyVoice2 | 0.5B | 7 | ❌ | ✅ | 22kHz | ✅ | Apache-2.0 |
| F5-TTS | 0.3B | 2 | ❌ | ✅ | 24kHz | ✅ | MIT |
| Bark | 0.9B | 13+ | ❌ | Limited | 24kHz | ✅ | MIT |
| ElevenLabs | — | 30+ | ❌ | ✅ | 44.1kHz | ❌ | Proprietary |
| Qwen3-TTS | 1.7B | 30+ | ❌ | ✅ | — | ✅ | Apache-2.0 |
| FishAudio S2 | 4B | 30+ | ❌ | ✅ | 48kHz | ✅ | Apache-2.0 |

### 5 key reasons we picked VoxCPM2

1. **Tokenizer-free architecture** — operates directly in continuous latent space, preserving subtle acoustic details (breath, pitch micro-variations, emotion) that discrete-token systems lose during quantization
2. **Hindi in training data** — trained on 2M+ hours including Hindi, giving it strong Devanagari phoneme coverage; Nepali shares ~80% of consonants and vowel inventory with Hindi
3. **Official LoRA support** — first-party fine-tuning scripts with documented YAML configs, not a community hack
4. **48kHz native output** — AudioVAE V2's asymmetric design (16kHz encode → 48kHz decode) means we get studio-quality output without external upsamplers
5. **Voice Design + Controllable Cloning** — unique features: create voices from text descriptions, clone with style control. These work with our LoRA adapter loaded on top
6. **Apache-2.0 license** — free for commercial use, no usage caps

### What VoxCPM2 excels at (benchmarks)

On the [MiniMax-MLS benchmark](https://platform.minimax.io/docs/guides/speech-evaluate):
- **85.4% SIM** on English voice similarity (vs ElevenLabs' 61.3%)
- **1.1% WER** on Chinese (vs ElevenLabs' 16%)
- State-of-the-art speaker similarity across 24 languages

On [Seed-TTS-eval](https://github.com/OpenBMB/VoxCPM):
- **75.3% SIM** on English, **79.5% SIM** on Chinese
- Competitive WER across the board

> Source: [Medium analysis by Ewan Mak](https://medium.com/@tentenco/voxcpm2-the-open-source-voice-model-that-beats-elevenlabs-on-similarity-but-the-full-benchmark-ffe408b50b87)

---

## 2. How VoxCPM2 Works — Architecture Deep Dive

VoxCPM2 uses a **tokenizer-free, diffusion autoregressive** paradigm. Unlike traditional TTS systems (ElevenLabs, Bark, CosyVoice) that convert audio → discrete tokens → predict next token → reconstruct audio, VoxCPM2 **never quantizes audio into tokens**. It works entirely in continuous latent space.

### The 4-Stage Pipeline

```
Text Input
    │
    ▼
┌─────────────────────────────────────────────────┐
│  Stage 1: LocEnc (Local Encoder)                │
│  Converts text → local text embeddings          │
│  Character/phoneme-level representation         │
└──────────────────────┬──────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────┐
│  Stage 2: TSLM (Text-Side Language Model)       │
│  MiniCPM-4 backbone (2B params)                 │
│  Predicts semantic content plan from text       │
│  Token rate: 6.25 Hz                            │
│  Max sequence length: 8192 tokens               │
└──────────────────────┬──────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────┐
│  Stage 3: RALM (Reference-Aware Language Model) │
│  Conditions on reference audio (for cloning)    │
│  Merges semantic plan + speaker identity        │
│  This is where Voice Design descriptions enter  │
└──────────────────────┬──────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────┐
│  Stage 4: LocDiT (Local Diffusion Transformer)  │
│  Generates continuous audio latents via         │
│  diffusion in AudioVAE V2's latent space        │
│  Flow-matching-based (inspired by CosyVoice)    │
└──────────────────────┬──────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────┐
│  AudioVAE V2 Decoder                            │
│  Asymmetric: encodes at 16kHz, decodes at 48kHz │
│  Built-in super-resolution, no external upsampler│
└──────────────────────┬──────────────────────────┘
                       │
                       ▼
              48kHz WAV Output
```

### Why tokenizer-free matters

Traditional TTS pipeline: `audio → VQ-VAE → discrete tokens → LM predicts → de-tokenize → vocoder`

The quantization step (VQ-VAE) crushes subtle acoustic details. A codebook of 1024–8192 entries cannot represent the infinite variation in human speech. You lose:
- Micro-pitch variations within syllables
- Breath texture and timing
- Emotional micro-expressions
- Speaker-specific formant transitions

VoxCPM2 skips all of this. The diffusion process operates in a continuous latent space where these details survive. This is why VoxCPM2's SIM (speaker similarity) scores are so high — it preserves more of what makes a voice *that* voice.

### AudioVAE V2's asymmetric trick

The encoder runs at **16kHz** (cheaper, captures linguistic content) while the decoder outputs at **48kHz** (studio quality). This means:
- Reference audio can be low-quality 16kHz recordings
- Output is always 48kHz regardless
- No external upsampler (like HiFi-GAN) needed
- The super-resolution is learned end-to-end

---

## 3. What Is LoRA and Why We Chose It

### LoRA in 30 seconds

**LoRA (Low-Rank Adaptation)** freezes the entire pretrained model and injects small trainable matrices into specific layers. Instead of updating a weight matrix `W` directly:

```
W' = W + B × A
```

Where:
- `W` = original frozen weight (e.g., 4096×4096 = 16M params)
- `B` = trainable matrix (4096×16 = 65K params)
- `A` = trainable matrix (16×4096 = 65K params)
- `r = 16` is the "rank" — how much capacity the adapter has

So instead of training 16M parameters per layer, we train only 130K. Across the whole model, this means:

### Full SFT vs LoRA comparison

| Aspect | Full SFT | LoRA (our choice) |
|--------|----------|-------------------|
| Trainable params | ~2B (100%) | ~20-50M (~1-2.5%) |
| VRAM required | 60-80 GB | 20-30 GB |
| Training time (same data) | 2-3× longer | Baseline |
| Risk of catastrophic forgetting | High — all languages affected | Low — base model frozen |
| Checkpoint size | ~8 GB | ~100-200 MB |
| Can hot-swap adapters | ❌ | ✅ Load different LoRA for different voices |
| Base model capabilities | May degrade | Fully preserved |
| Inference overhead | None | Negligible (<1% latency) |

### Our specific LoRA configuration

```yaml
lora:
  enable_lm: true    # Adapt the TSLM (language model) — learns Nepali text→speech mapping
  enable_dit: true   # Adapt the LocDiT (diffusion) — learns Nepali acoustic patterns
  r: 16              # Rank 16: sweet spot between capacity and efficiency
  alpha: 32          # Alpha = 2×r: standard scaling factor (effective lr multiplier = alpha/r = 2)
  dropout: 0.05      # Light regularization to prevent overfitting on small dataset
```

**Why `r=16`?** VoxCPM2's official config uses r=16. For a 2B model with ~5 hours of Nepali data, this gives enough capacity to learn Nepali phonemes without overfitting. Going higher (r=32, r=64) risks memorizing training data; going lower (r=4, r=8) might not capture all Nepali-specific patterns.

**Why both `enable_lm` and `enable_dit`?** The LM learns text-to-semantic mapping (how Nepali text maps to speech plans), while the DiT learns acoustic generation (how Nepali speech actually sounds). Both need adaptation for a new language.

---

## 4. Datasets — OpenSLR-43 & OpenSLR-143

### Why these datasets?

For a low-resource language like Nepali, publicly available TTS datasets are scarce. We used two complementary datasets:

| Dataset | Source | Speakers | Hours | Format | Content |
|---------|--------|----------|-------|--------|---------|
| **OpenSLR-43** | [openslr.org/43](https://openslr.org/43/) | 1 female | ~3 hrs | WAV + TSV | Read speech, news-style |
| **OpenSLR-143** | [openslr.org/143](https://openslr.org/143/) | 1 male + 1 female | ~2-3 hrs | WAV + TSV | Read speech, general |

**Combined:** ~5-6 hours of clean, read Nepali speech with transcripts — well above VoxCPM2's minimum of 5-10 minutes for LoRA fine-tuning.

### Data structure (confirmed by inspection)

**OpenSLR-43:**
```
ne_np_female/
├── line_index.tsv          # Format: filename<TAB>transcript (no header)
└── wavs/
    ├── ne_0001_0037308034.wav
    ├── ne_0001_0340076359.wav
    └── ...
```

**OpenSLR-143:**
```
male-female-data/
├── FemaleVoice.tsv         # Format: audio_id<TAB>sentence (has header row)
├── MaleVoice.tsv           # Format: audio_id<TAB>sentence (has header row)
├── fe_00001234.wav         # WAVs directly in this directory (not in subdirectory)
├── ma_00001234.wav
└── ...
```

---

## 5. Why H100 Instead of L40S

### The original plan

We initially planned to use **2× NVIDIA L40S** GPUs with `torchrun` distributed training. The SLURM script was configured for:
```
#SBATCH --gres=gpu:2
#SBATCH -C gpul40s
```

### What actually happened

L40S nodes were either unavailable or had long queue times. We switched to a **single H100 80GB** node, which turned out to be better for this workload anyway.

### Hardware comparison

| Spec | L40S (×2) | H100 80GB (×1) |
|------|-----------|----------------|
| VRAM | 48 GB × 2 = 96 GB total | 80 GB |
| Memory bandwidth | 864 GB/s × 2 | **3.35 TB/s** |
| FP16 TFLOPS | 362 × 2 = 724 | **989** |
| BF16 TFLOPS | 362 × 2 = 724 | **989** |
| NVLink | ❌ (PCIe only) | N/A (single GPU) |
| Effective throughput | ~60-70% of theoretical (inter-GPU comm overhead) | ~95%+ (no comm) |
| Distributed training overhead | Gradient sync, data parallel | None |

### Why single H100 > 2× L40S for this workload

1. **Memory bandwidth is king for LLM-based models.** VoxCPM2's TSLM is a 2B transformer — attention operations are memory-bandwidth bound. H100's 3.35 TB/s vs L40S's 864 GB/s means ~4× faster attention per GPU.

2. **No distributed training overhead.** With 2× L40S, `torchrun` uses `DistributedDataParallel` which requires gradient synchronization after every step. Over PCIe (not NVLink), this adds 10-30% overhead for a 2B model.

3. **80GB VRAM is sufficient.** With LoRA, our memory footprint is:
   - Base model (bf16): ~4 GB
   - LoRA params: ~100 MB
   - Optimizer states: ~400 MB
   - Activation memory (batch_size=4): ~15-20 GB
   - **Total: ~25 GB** → fits easily in 80 GB

4. **Simpler debugging.** Single-GPU training eliminates an entire class of distributed training bugs.

### Updated SLURM config

```bash
#SBATCH -C gpu2h100        # H100 constraint
#SBATCH --gres=gpu:1       # Single GPU
```

And `torchrun --nproc_per_node=1` (or just `python` directly).

We compensated for using 1 GPU instead of 2 by doubling `grad_accum_steps` from 4 to 8, maintaining the same effective batch size: `batch_size(4) × grad_accum(8) = 32 effective`.

---

## 6. Step-by-Step: Environment Setup

```bash
# Load required modules (CWRU Pioneer-specific versions)
module load CUDA/12.1.1
module load Python/3.10.4-GCCcore-11.3.0

# Create virtual environment
python3 -m venv ~/voxcpm_env
source ~/voxcpm_env/bin/activate

# Upgrade pip
pip install --upgrade pip

# Install PyTorch with CUDA 12.1 support
pip install torch==2.1.2+cu121 torchaudio==2.1.2+cu121 \
    --index-url https://download.pytorch.org/whl/cu121

# Install dependencies
pip install soundfile pyyaml tensorboard tqdm huggingface_hub safetensors

# Verify installation
python -c "import torch; print(f'PyTorch {torch.__version__}, CUDA available: {torch.cuda.is_available()}')"
```

### Install VoxCPM from source (not pip)

We discovered that `pip install voxcpm` was not available or didn't work on the cluster. Instead, we cloned the repository and installed from source:

```bash
cd ~/voxcpm_nepali
git clone https://github.com/OpenBMB/VoxCPM.git voxcpm_repo
pip install -e voxcpm_repo/

# ⚠ CRITICAL: The editable install may uninstall your pinned torch version!
# If you see "Successfully uninstalled torch-2.1.2+cu121", reinstall:
pip install torch==2.1.2+cu121 torchaudio==2.1.2+cu121 \
    --index-url https://download.pytorch.org/whl/cu121

# Add local bin to PATH (pip installs scripts there)
export PATH=~/.local/bin:$PATH

# Verify
python -c "from voxcpm import VoxCPM; print('VoxCPM OK')"
```

---

## 7. Step-by-Step: Model & Data Download

### 7a. Download VoxCPM2 pretrained model

```bash
mkdir -p ~/voxcpm_nepali/pretrained_models
cd ~/voxcpm_nepali

python3 << 'EOF'
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="openbmb/VoxCPM2",
    local_dir="./pretrained_models/VoxCPM2",
    local_dir_use_symlinks=False,
)
print("Done.")
EOF

# Verify (~8 GB of model files)
ls -lh ~/voxcpm_nepali/pretrained_models/VoxCPM2/
```

### 7b. Download Nepali datasets

```bash
mkdir -p ~/voxcpm_nepali/data/raw
cd ~/voxcpm_nepali/data/raw

# OpenSLR-43: Nepali female voice
wget https://www.openslr.org/resources/43/ne_np_female.zip
unzip ne_np_female.zip

# OpenSLR-143: Nepali male + female
wget https://www.openslr.org/resources/143/male-female-data.tgz
tar xzf male-female-data.tgz

# Verify structure
ls data/raw/ne_np_female/wavs/ | head -5
head -3 data/raw/ne_np_female/line_index.tsv
head -3 data/raw/male-female-data/FemaleVoice.tsv
head -3 data/raw/male-female-data/MaleVoice.tsv
```

---

## 8. Step-by-Step: Data Preparation

VoxCPM2's training script expects JSONL manifests where each line is:
```json
{"audio": "/absolute/path/to/file.wav", "text": "transcript text"}
```

All audio must be **16kHz mono WAV** (AudioVAE V2's encoder input rate).

### `prepare_manifest.py`

```python
#!/usr/bin/env python3
"""
Prepare JSONL manifests for VoxCPM2 LoRA fine-tuning.
Reads OpenSLR-43 and OpenSLR-143, resamples audio to 16kHz mono,
and creates train/val splits.
"""

import json, os, random, csv
from pathlib import Path
import torchaudio

random.seed(42)

# ── Paths ─────────────────────────────────────────────────────────
HOME = os.path.expanduser("~")
PROJECT = Path(f"{HOME}/voxcpm_nepali")
RAW = PROJECT / "data" / "raw"
OUT_WAV = PROJECT / "data" / "processed" / "wavs_16k"
OUT_MANIFEST = PROJECT / "data" / "manifests"
OUT_WAV.mkdir(parents=True, exist_ok=True)
OUT_MANIFEST.mkdir(parents=True, exist_ok=True)

TARGET_SR = 16000
VAL_RATIO = 0.05  # 5% for validation

entries = []

# ── OpenSLR-43 ────────────────────────────────────────────────────
slr43_dir = RAW / "ne_np_female"
tsv_path = slr43_dir / "line_index.tsv"
with open(tsv_path) as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        parts = line.split("\t")
        if len(parts) < 2:
            continue
        fname, text = parts[0].strip(), parts[1].strip()
        wav_path = slr43_dir / "wavs" / f"{fname}.wav"
        if wav_path.exists():
            entries.append(("slr43", str(wav_path), text))

print(f"OpenSLR-43: {len(entries)} entries")

# ── OpenSLR-143 ───────────────────────────────────────────────────
slr143_dir = RAW / "male-female-data"
for tsv_name in ["FemaleVoice.tsv", "MaleVoice.tsv"]:
    tsv_file = slr143_dir / tsv_name
    if not tsv_file.exists():
        print(f"  WARNING: {tsv_file} not found, skipping")
        continue
    count = 0
    with open(tsv_file) as f:
        reader = csv.reader(f, delimiter="\t")
        header = next(reader)  # skip header row
        for row in reader:
            if len(row) < 2:
                continue
            audio_id, text = row[0].strip(), row[1].strip()
            wav_path = slr143_dir / f"{audio_id}.wav"
            if wav_path.exists():
                entries.append(("slr143", str(wav_path), text))
                count += 1
    print(f"  {tsv_name}: {count} entries")

print(f"Total entries: {len(entries)}")

# ── Resample and write ────────────────────────────────────────────
resampler = torchaudio.transforms.Resample(orig_freq=48000, new_freq=TARGET_SR)

manifest = []
for i, (src, wav_path, text) in enumerate(entries):
    try:
        waveform, sr = torchaudio.load(wav_path)
        # Convert to mono
        if waveform.shape[0] > 1:
            waveform = waveform.mean(dim=0, keepdim=True)
        # Resample if needed
        if sr != TARGET_SR:
            resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=TARGET_SR)
            waveform = resampler(waveform)
        # Save
        out_name = f"{src}_{i:06d}.wav"
        out_path = OUT_WAV / out_name
        torchaudio.save(str(out_path), waveform, TARGET_SR)
        manifest.append({"audio": str(out_path), "text": text})
    except Exception as e:
        print(f"  ERROR processing {wav_path}: {e}")

    if (i + 1) % 500 == 0:
        print(f"  Processed {i+1}/{len(entries)}")

print(f"Successfully processed: {len(manifest)}")

# ── Train/val split ───────────────────────────────────────────────
random.shuffle(manifest)
val_size = max(1, int(len(manifest) * VAL_RATIO))
val_set = manifest[:val_size]
train_set = manifest[val_size:]

for split, data in [("train", train_set), ("val", val_set)]:
    path = OUT_MANIFEST / f"{split}.jsonl"
    with open(path, "w") as f:
        for entry in data:
            f.write(json.dumps(entry, ensure_ascii=False) + "\n")
    print(f"  {split}.jsonl: {len(data)} entries")

print("Done!")
```

### Run it

```bash
cd ~/voxcpm_nepali
python prepare_manifest.py

# Sanity checks
wc -l data/manifests/train.jsonl data/manifests/val.jsonl
head -3 data/manifests/train.jsonl
python -c "
import torchaudio
w, sr = torchaudio.load('$(head -1 data/manifests/train.jsonl | python -c \"import json,sys;print(json.loads(sys.stdin.read())[\\\"audio\\\"])\")')
print(f'sr={sr}, shape={w.shape}, duration={w.shape[1]/sr:.2f}s')
"
```

---

## 9. Step-by-Step: YAML Configuration

### `conf/finetune_lora_nepali.yaml`

```yaml
# ── Pretrained model ──────────────────────────────────────────────
pretrained_path: /home/sra42/voxcpm_nepali/pretrained_models/VoxCPM2

# ── Data ──────────────────────────────────────────────────────────
train_manifest: /home/sra42/voxcpm_nepali/data/manifests/train.jsonl
val_manifest:   /home/sra42/voxcpm_nepali/data/manifests/val.jsonl
sample_rate:    16000   # AudioVAE V2 encoder input rate

# ── Training hyperparameters ──────────────────────────────────────
batch_size:      4      # per-GPU batch size
grad_accum_steps: 8     # effective batch = 4 × 8 = 32 (doubled from 4 to compensate for single GPU)
num_workers:     4
num_iters:       5000   # ~5-6 hrs of data × multiple epochs
learning_rate:   2.0e-4 # standard for LoRA

# ── LoRA configuration ────────────────────────────────────────────
lora:
  enable_lm:  true      # adapt TSLM (language model) for Nepali text → speech mapping
  enable_dit: true      # adapt LocDiT (diffusion transformer) for Nepali acoustics
  r:          16        # low-rank dimension
  alpha:      32        # scaling factor (alpha/r = 2.0)
  dropout:    0.05      # regularization

# ── Checkpointing & logging ──────────────────────────────────────
save_dir:     /home/sra42/voxcpm_nepali/checkpoints/lora_nepali
save_interval: 500      # save every 500 steps
log_dir:      /home/sra42/voxcpm_nepali/logs/lora_nepali
```

### Parameter reasoning

| Parameter | Value | Why |
|-----------|-------|-----|
| `batch_size: 4` | 4 | Fits in H100 80GB with LoRA. Larger batches average out noise. |
| `grad_accum_steps: 8` | 8 | Effective batch 32. Compensates for single GPU (was 4 with 2× L40S). |
| `num_iters: 5000` | 5000 | With ~5hrs of data and batch_size=4, this covers the dataset ~10-15 times. |
| `learning_rate: 2e-4` | 2e-4 | Standard for LoRA (10× higher than full SFT's typical 2e-5). |
| `save_interval: 500` | 500 | Gives us 10 checkpoints to pick the best one. |
| `sample_rate: 16000` | 16000 | AudioVAE V2 encoder expects 16kHz input. Output is still 48kHz. |

### Common mistakes to avoid

- ❌ Setting `sample_rate: 48000` — the encoder input must be 16kHz
- ❌ Using relative paths — SLURM jobs run from a different working directory
- ❌ Setting `grad_accum_steps: 1` with small batch — effective batch too small, unstable training
- ❌ Forgetting `enable_dit: true` — the model can "understand" Nepali text but produce wrong acoustics

---

## 10. Step-by-Step: SLURM Script

### `submit_lora.slurm`

```bash
#!/bin/bash
#SBATCH --job-name=voxcpm_lora_nepali
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -c 16
#SBATCH --mem=64gb
#SBATCH -p gpu
#SBATCH -C gpu2h100
#SBATCH --gres=gpu:1
#SBATCH -t 12:00:00
#SBATCH -A mxh605
#SBATCH -o logs/slurm_%j.out
#SBATCH -e logs/slurm_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=sra42@case.edu

# ── Module & environment ─────────────────────────────────────────
module load CUDA/12.1.1
module load Python/3.10.4-GCCcore-11.3.0
source ~/voxcpm_env/bin/activate
export PATH=~/.local/bin:$PATH

# ── Paths ─────────────────────────────────────────────────────────
TRAIN_SCRIPT=~/voxcpm_nepali/voxcpm_repo/scripts/train_voxcpm_finetune.py
CONFIG=~/voxcpm_nepali/conf/finetune_lora_nepali.yaml

# ── Launch training ───────────────────────────────────────────────
cd ~/voxcpm_nepali
mkdir -p logs

torchrun \
    --nproc_per_node=1 \
    --master_port=29500 \
    "$TRAIN_SCRIPT" \
    --config_path "$CONFIG"
```

### SLURM directives explained

| Directive | Value | Why |
|-----------|-------|-----|
| `-N 1` | 1 node | Single-node training |
| `-n 1` | 1 task | One training process |
| `-c 16` | 16 CPUs | Data loading workers (num_workers=4 × some headroom) |
| `--mem=64gb` | 64 GB RAM | Dataset loading + preprocessing buffer |
| `-p gpu` | gpu partition | GPU-enabled nodes |
| `-C gpu2h100` | H100 constraint | Specifically request H100 nodes |
| `--gres=gpu:1` | 1 GPU | Single H100 |
| `-t 12:00:00` | 12 hours | 5000 iterations with buffer |
| `-A mxh605` | PI account | Allocation/billing account |

---

## 11. Step-by-Step: Submit & Monitor Training

```bash
# Submit the job
cd ~/voxcpm_nepali
sbatch submit_lora.slurm

# Check queue status
squeue -u sra42

# Watch training logs in real-time
tail -f logs/slurm_<JOBID>.out

# Monitor GPU utilization (from another terminal)
ssh sra42@pioneer.case.edu
srun --jobid=<JOBID> --pty nvidia-smi -l 5

# TensorBoard (SSH tunnel)
# Terminal 1 (on cluster):
tensorboard --logdir=~/voxcpm_nepali/logs/lora_nepali --port=6006 --bind_all
# Terminal 2 (local machine):
ssh -L 6006:localhost:6006 sra42@pioneer.case.edu
# Then open http://localhost:6006 in browser
```

### What to watch in training logs

- **Loss decreasing** — should drop steadily over first 1000 steps, then plateau
- **GPU memory** — should be ~25-30 GB for LoRA (if 70+ GB, something is wrong)
- **Iteration speed** — expect ~2-5 seconds per step on H100

---

## 12. Troubleshooting Log

These are actual bugs we encountered and fixed:

### Bug 1: `python: command not found`

**Symptom:** Running `python "$TRAIN_SCRIPT" --help` gave "command not found"  
**Cause:** Virtual environment not activated, or `python` binary not in PATH  
**Fix:** 
```bash
source ~/voxcpm_env/bin/activate
export PATH=~/.local/bin:$PATH
# Or use python3 instead of python
```

### Bug 2: `pip show -f voxcpm` → "Package not found"

**Symptom:** `voxcpm` module couldn't be imported despite thinking we installed it  
**Cause:** `pip install voxcpm` may not have been available or failed silently  
**Fix:** Install from cloned repository:
```bash
git clone https://github.com/OpenBMB/VoxCPM.git ~/voxcpm_nepali/voxcpm_repo
pip install -e ~/voxcpm_nepali/voxcpm_repo/
```

### Bug 3: `pip install -e voxcpm_repo/` uninstalled our pinned PyTorch

**Symptom:** Warning about torch being uninstalled: "Successfully uninstalled torch-2.1.2+cu121"  
**Cause:** VoxCPM's `pyproject.toml` specifies `torch>=2.5.0` as a dependency. pip resolved this by uninstalling our CUDA-specific torch and installing a newer CPU-only or different CUDA version.  
**Fix:** Reinstall the correct torch immediately after:
```bash
pip install torch==2.1.2+cu121 torchaudio==2.1.2+cu121 \
    --index-url https://download.pytorch.org/whl/cu121
```

### Bug 4: Scripts installed to `~/.local/bin` not on PATH

**Symptom:** `WARNING: The scripts ... are installed in '/home/sra42/.local/bin' which is not on PATH`  
**Cause:** pip installs entry-point scripts to `~/.local/bin` which isn't in HPC default PATH  
**Fix:** Add to SLURM script and `.bashrc`:
```bash
export PATH=~/.local/bin:$PATH
```

### Bug 5: `--config` vs `--config_path` flag uncertainty

**Symptom:** Not sure which CLI flag the training script accepts  
**Cause:** VoxCPM's training script uses `--config_path` but we initially guessed `--config`  
**Fix:** Check with `python train_voxcpm_finetune.py --help` and use the correct flag

### Bug 6: L40S nodes unavailable

**Symptom:** Jobs stuck in queue indefinitely with `-C gpul40s --gres=gpu:2`  
**Cause:** L40S partition was heavily utilized or under maintenance  
**Fix:** Switched to `-C gpu2h100 --gres=gpu:1` with `grad_accum_steps: 8` to maintain effective batch size

---

## 13. Testing — Single Model Inference

### `test_inference.py`

```python
#!/usr/bin/env python3
"""Quick test: generate Nepali speech with the fine-tuned LoRA model."""

import torch
import soundfile as sf
from pathlib import Path
from safetensors.torch import load_file
from voxcpm.model import VoxCPM2Model
from voxcpm.model.voxcpm2 import LoRAConfig

PRETRAINED = "/home/sra42/voxcpm_nepali/pretrained_models/VoxCPM2"
LORA_CKPT  = "/home/sra42/voxcpm_nepali/checkpoints/lora_nepali/latest/lora_weights.safetensors"
OUTPUT_DIR = Path("/home/sra42/voxcpm_nepali/test_outputs")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Load model with LoRA
lora_cfg = LoRAConfig(enable_lm=True, enable_dit=True, r=16, alpha=32, dropout=0.05)
model = VoxCPM2Model.from_local(PRETRAINED, optimize=False, training=False, lora_config=lora_cfg)
lora_state = load_file(LORA_CKPT)
model.load_state_dict(lora_state, strict=False)
model.eval()
model = model.to("cuda")

SAMPLE_RATE = model.sample_rate or 48000

test_texts = [
    "नमस्ते, मेरो नाम VoxCPM हो।",
    "नेपाल एक सुन्दर देश हो।",
    "आज मौसम धेरै राम्रो छ।",
]

for i, text in enumerate(test_texts):
    print(f"Generating [{i+1}]: {text}")
    with torch.no_grad():
        audio = model.generate(
            target_text=text,
            inference_timesteps=50,
            cfg_value=3.5,
        )
    audio_np = audio.cpu().float().numpy().flatten()
    sf.write(str(OUTPUT_DIR / f"sample_{i+1}.wav"), audio_np, SAMPLE_RATE)
    print(f"  Saved: sample_{i+1}.wav ({len(audio_np)/SAMPLE_RATE:.2f}s)")

print("Done!")
```

### Run

```bash
srun -p gpu -C gpu2h100 --gres=gpu:1 -c 4 --mem=32gb -A mxh605 --pty bash
source ~/voxcpm_env/bin/activate && module load CUDA/12.1.1 Python/3.10.4-GCCcore-11.3.0
cd ~/voxcpm_nepali && python test_inference.py
```

---

## 14. Testing — Original vs Fine-tuned Comparison

### `test_compare.py`

```python
#!/usr/bin/env python3
"""
Compare original VoxCPM2 vs LoRA fine-tuned model on Nepali TTS.
Outputs paired WAV files for side-by-side listening.
"""

import torch
import soundfile as sf
import numpy as np
from pathlib import Path
from safetensors.torch import load_file
from voxcpm.model import VoxCPM2Model
from voxcpm.model.voxcpm2 import LoRAConfig

# ── Config ────────────────────────────────────────────────────────
PRETRAINED  = "/home/sra42/voxcpm_nepali/pretrained_models/VoxCPM2"
LORA_CKPT   = "/home/sra42/voxcpm_nepali/checkpoints/lora_nepali/latest/lora_weights.safetensors"
OUTPUT_DIR  = Path("/home/sra42/voxcpm_nepali/test_outputs/compare")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

INFERENCE_TIMESTEPS = 50
CFG_VALUE           = 3.5
SAMPLE_RATE         = 48000

# ── Test sentences ────────────────────────────────────────────────
TEST_TEXTS = [
    "नमस्ते।",
    "आज मौसम धेरै राम्रो छ, हामी बाहिर घुम्न जाऔं।",
    "नेपाल एक सुन्दर देश हो जहाँ हिमालय, पहाड र तराई क्षेत्र छन्।",
    "सरकारले नयाँ शिक्षा नीति लागू गर्ने घोषणा गरेको छ, जसले विद्यार्थीहरूको भविष्य उज्यालो बनाउने अपेक्षा गरिएको छ।",
    "एक समयको कुरा हो, एउटा सानो गाउँमा एक जना बुद्धिमान वृद्ध मानिस बस्थे। उनले आफ्नो जीवनभर धेरै कठिनाइहरू झेलेका थिए, तर कहिल्यै हार मानेनन्।",
    "नेपालको क्षेत्रफल एक लाख सत्तरी हजार वर्ग किलोमिटर छ र जनसंख्या लगभग तीन करोड छ।",
    "तपाईंको नाम के हो? तपाईं कहाँबाट आउनुभयो? के तपाईंलाई नेपाली खाना मनपर्छ?",
    "आमाको माया संसारमा सबैभन्दा ठूलो माया हो। उनको आँचलमा सुत्दा संसारका सबै दुःख बिर्सिन्छन्।",
]

# ── Helpers ────────────────────────────────────────────────────────
def normalize(audio):
    peak = np.abs(audio).max()
    return audio / peak * 0.9 if peak > 0 else audio

def generate(model, text, label):
    print(f"    [{label}] generating...", end=" ", flush=True)
    try:
        with torch.no_grad():
            audio = model.generate(target_text=text, inference_timesteps=INFERENCE_TIMESTEPS, cfg_value=CFG_VALUE)
        if audio is None:
            print("FAILED"); return None
        audio_np = normalize(audio.cpu().float().numpy().flatten())
        print(f"OK ({len(audio_np)/SAMPLE_RATE:.2f}s)")
        return audio_np
    except Exception as e:
        print(f"ERROR: {e}"); return None

# ── Load original model ──────────────────────────────────────────
print("Loading ORIGINAL VoxCPM2...")
orig_model = VoxCPM2Model.from_local(PRETRAINED, optimize=False, training=False)
orig_model.eval().to("cuda")

# ── Load fine-tuned model ─────────────────────────────────────────
print("Loading FINE-TUNED (LoRA) VoxCPM2...")
lora_cfg = LoRAConfig(enable_lm=True, enable_dit=True, r=16, alpha=32, dropout=0.05)
ft_model = VoxCPM2Model.from_local(PRETRAINED, optimize=False, training=False, lora_config=lora_cfg)
lora_state = load_file(LORA_CKPT)
ft_model.load_state_dict(lora_state, strict=False)
ft_model.eval().to("cuda")

# ── Run comparison ────────────────────────────────────────────────
for i, text in enumerate(TEST_TEXTS):
    print(f"\n[{i+1}/{len(TEST_TEXTS)}] {text[:60]}...")
    orig_audio = generate(orig_model, text, "original")
    ft_audio   = generate(ft_model,   text, "finetuned")

    silence = np.zeros(int(SAMPLE_RATE * 0.7), dtype=np.float32)

    if orig_audio is not None:
        sf.write(str(OUTPUT_DIR / f"{i+1:02d}_original.wav"), orig_audio, SAMPLE_RATE)
    if ft_audio is not None:
        sf.write(str(OUTPUT_DIR / f"{i+1:02d}_finetuned.wav"), ft_audio, SAMPLE_RATE)
    if orig_audio is not None and ft_audio is not None:
        combined = np.concatenate([orig_audio, np.zeros(int(SAMPLE_RATE), dtype=np.float32), ft_audio])
        sf.write(str(OUTPUT_DIR / f"{i+1:02d}_COMPARE.wav"), combined, SAMPLE_RATE)

print(f"\n✓ All outputs in: {OUTPUT_DIR}")
```

### Run and retrieve results

```bash
srun -p gpu -C gpu2h100 --gres=gpu:1 -c 4 --mem=32gb -A mxh605 --pty bash
source ~/voxcpm_env/bin/activate && module load CUDA/12.1.1 Python/3.10.4-GCCcore-11.3.0
cd ~/voxcpm_nepali && python test_compare.py

# Copy to local machine
scp -r sra42@pioneer.case.edu:~/voxcpm_nepali/test_outputs/compare/ ./nepali_tts_compare/
```

### What to listen for

| Aspect | Original (base) | Fine-tuned (LoRA) |
|--------|-----------------|-------------------|
| Pronunciation | May use Hindi-like phonemes | Clearer Nepali phonemes (retroflex ट/ठ/ड/ढ, aspirated stops) |
| Rhythm | Generic South Asian prosody | More natural Nepali cadence |
| Intonation | May miss sentence-final patterns | Better Nepali-specific falling tones |
| Fluency | Possible pauses/hesitation on Nepali text | Smoother, more confident delivery |

---

## 15. Voice Cloning with Fine-tuned Model

VoxCPM2 supports **3 levels of voice cloning**, and all work with our LoRA adapter loaded on top. The fine-tuning improves Nepali pronunciation while the cloning preserves the target speaker's voice.

### How voice cloning works in VoxCPM2

The RALM (Reference-Aware Language Model) conditions generation on a reference audio clip. It extracts speaker identity (timbre, pitch range, speaking rate) from the reference and merges it with the text content plan from the TSLM. The result: the model speaks your text in the reference speaker's voice.

### `test_voice_clone.py`

```python
#!/usr/bin/env python3
"""
Voice cloning with the fine-tuned VoxCPM2 model.
Demonstrates all 3 cloning modes: basic, controllable, and ultimate.
"""

import torch
import soundfile as sf
import numpy as np
from pathlib import Path
from voxcpm import VoxCPM

# ── Config ────────────────────────────────────────────────────────
MODEL_PATH     = "/home/sra42/voxcpm_nepali/pretrained_models/VoxCPM2"
LORA_CKPT_DIR  = "/home/sra42/voxcpm_nepali/checkpoints/lora_nepali/latest"
REFERENCE_WAV  = "/home/sra42/voxcpm_nepali/data/processed/wavs_16k/slr43_000000.wav"  # pick any training sample
REFERENCE_TEXT = "यो एउटा परीक्षण वाक्य हो।"  # transcript of the reference audio (for ultimate cloning)
OUTPUT_DIR     = Path("/home/sra42/voxcpm_nepali/test_outputs/cloning")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# ── Load model with LoRA ─────────────────────────────────────────
print("Loading VoxCPM2 with Nepali LoRA adapter...")
model = VoxCPM.from_pretrained(MODEL_PATH, load_denoiser=False)
# Load LoRA weights (adjust if API differs)
# model.load_lora(LORA_CKPT_DIR)  # check VoxCPM API for exact method

SAMPLE_RATE = model.tts_model.sample_rate  # 48000

# ── Test texts ────────────────────────────────────────────────────
NEPALI_TEXTS = [
    "नमस्ते, म तपाईंको आवाजमा बोलिरहेको छु।",
    "नेपाल एक सुन्दर देश हो जहाँ हिमालय छन्।",
    "आज मौसम राम्रो छ, बाहिर जाऔं।",
]

# ═══════════════════════════════════════════════════════════════════
# MODE 1: Basic Voice Cloning
# ─────────────────────────────────────────────────────────────────
# Just provide reference audio. Model clones the timbre automatically.
# ═══════════════════════════════════════════════════════════════════
print("\n📎 Mode 1: Basic Voice Cloning")
for i, text in enumerate(NEPALI_TEXTS):
    print(f"  [{i+1}] {text[:50]}...")
    wav = model.generate(
        text=text,
        reference_wav_path=REFERENCE_WAV,
    )
    sf.write(str(OUTPUT_DIR / f"basic_clone_{i+1}.wav"), wav, SAMPLE_RATE)
    print(f"      Saved: basic_clone_{i+1}.wav")

# ═══════════════════════════════════════════════════════════════════
# MODE 2: Controllable Voice Cloning
# ─────────────────────────────────────────────────────────────────
# Clone the voice BUT add style control instructions.
# The model preserves the speaker's timbre while adjusting
# speed, emotion, tone as instructed.
# ═══════════════════════════════════════════════════════════════════
print("\n🎛️ Mode 2: Controllable Voice Cloning")

style_controls = [
    ("slightly faster, cheerful tone", "तपाईंलाई भेटेर खुसी लाग्यो!"),
    ("slow and calm, whispering", "रातको शान्त समयमा हिमालयको सुन्दरता अलग नै हुन्छ।"),
    ("energetic, excited voice", "नेपालले विश्वकप जित्यो! यो अविश्वसनीय छ!"),
]

for i, (style, text) in enumerate(style_controls):
    styled_text = f"({style}){text}"
    print(f"  [{i+1}] Style: {style}")
    print(f"       Text: {text[:50]}...")
    wav = model.generate(
        text=styled_text,
        reference_wav_path=REFERENCE_WAV,
        cfg_value=2.0,
        inference_timesteps=10,
    )
    sf.write(str(OUTPUT_DIR / f"controllable_clone_{i+1}.wav"), wav, SAMPLE_RATE)
    print(f"      Saved: controllable_clone_{i+1}.wav")

# ═══════════════════════════════════════════════════════════════════
# MODE 3: Ultimate Voice Cloning
# ─────────────────────────────────────────────────────────────────
# Provide reference audio + its exact transcript.
# The model uses audio-continuation: it treats the reference as
# the beginning of the utterance and continues from there.
# This reproduces every vocal nuance — timbre, rhythm, emotion.
# For maximum similarity, pass the same clip to both
# reference_wav_path and prompt_wav_path.
# ═══════════════════════════════════════════════════════════════════
print("\n🎙️ Mode 3: Ultimate Voice Cloning")
for i, text in enumerate(NEPALI_TEXTS):
    print(f"  [{i+1}] {text[:50]}...")
    wav = model.generate(
        text=text,
        prompt_wav_path=REFERENCE_WAV,
        prompt_text=REFERENCE_TEXT,
        reference_wav_path=REFERENCE_WAV,  # same clip for max similarity
    )
    sf.write(str(OUTPUT_DIR / f"ultimate_clone_{i+1}.wav"), wav, SAMPLE_RATE)
    print(f"      Saved: ultimate_clone_{i+1}.wav")

print(f"\n✓ All cloning outputs in: {OUTPUT_DIR}")
print("\nFile guide:")
print("  basic_clone_*.wav         → timbre cloned, default style")
print("  controllable_clone_*.wav  → timbre cloned + style instructions applied")
print("  ultimate_clone_*.wav      → highest fidelity, audio-continuation mode")
```

### Run

```bash
srun -p gpu -C gpu2h100 --gres=gpu:1 -c 4 --mem=32gb -A mxh605 --pty bash
source ~/voxcpm_env/bin/activate && module load CUDA/12.1.1 Python/3.10.4-GCCcore-11.3.0
cd ~/voxcpm_nepali && python test_voice_clone.py
```

### How the 3 cloning modes differ

```
┌─────────────────────────────────────────────────────────────────┐
│                     BASIC CLONING                               │
│  Input: reference_wav_path + text                               │
│  What happens:                                                  │
│    1. AudioVAE encodes reference → speaker embedding            │
│    2. TSLM plans speech from text                               │
│    3. RALM merges speaker identity + speech plan                │
│    4. LocDiT generates audio in cloned voice                    │
│  Result: Same voice, neutral style                              │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                  CONTROLLABLE CLONING                            │
│  Input: reference_wav_path + "(style instructions)text"         │
│  What happens:                                                  │
│    Same as basic, but style instructions modify the TSLM's      │
│    speech plan. The model adjusts prosody/emotion/speed while   │
│    keeping the cloned timbre from the reference.                │
│  Result: Same voice + controlled style                          │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                    ULTIMATE CLONING                              │
│  Input: prompt_wav_path + prompt_text + reference_wav_path      │
│  What happens:                                                  │
│    1. Model ingests (audio, transcript) as a continuation prompt│
│    2. Treats the reference as the START of the utterance        │
│    3. Generates new text as a seamless CONTINUATION             │
│    4. Reproduces every nuance: timbre, rhythm, emotion, style   │
│  Result: Highest fidelity — indistinguishable from reference    │
└─────────────────────────────────────────────────────────────────┘
```

### LoRA's role in cloning

Our Nepali LoRA adapter improves cloning for Nepali text because:
- The TSLM adapter learned Nepali text → speech planning, so pronunciation is correct
- The LocDiT adapter learned Nepali acoustic patterns, so generated audio sounds natural
- The cloning mechanism (RALM) is separate from what LoRA modifies, so speaker identity transfer is unaffected
- **Net effect:** Clone any voice, but now the cloned voice speaks correct Nepali

---

## 16. Voice Design with Fine-tuned Model

**Voice Design** is VoxCPM2's unique feature: create a brand-new voice from a **text description alone** — no reference audio needed. You describe the voice (gender, age, tone, emotion, pace) and the model generates a matching voice.

### How it works

The voice description goes into the TSLM as a conditioning prefix. The parenthesized description `(A young woman, gentle and sweet voice)` is parsed by the model and used to shape the speaker characteristics of the generated audio. The model was trained on large-scale (audio, description) pairs so it learned to map natural language descriptions to speaker embeddings internally.

### `test_voice_design.py`

```python
#!/usr/bin/env python3
"""
Voice Design: Create new voices from text descriptions.
Uses the fine-tuned LoRA model for correct Nepali pronunciation.
"""

import torch
import soundfile as sf
from pathlib import Path
from voxcpm import VoxCPM

# ── Config ────────────────────────────────────────────────────────
MODEL_PATH     = "/home/sra42/voxcpm_nepali/pretrained_models/VoxCPM2"
LORA_CKPT_DIR  = "/home/sra42/voxcpm_nepali/checkpoints/lora_nepali/latest"
OUTPUT_DIR     = Path("/home/sra42/voxcpm_nepali/test_outputs/voice_design")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# ── Load model with LoRA ─────────────────────────────────────────
print("Loading VoxCPM2 with Nepali LoRA adapter...")
model = VoxCPM.from_pretrained(MODEL_PATH, load_denoiser=False)
# model.load_lora(LORA_CKPT_DIR)  # load LoRA adapter

SAMPLE_RATE = model.tts_model.sample_rate  # 48000

# ── Voice designs ─────────────────────────────────────────────────
# Format: (voice description)Text to speak
# The description in parentheses controls WHO speaks.
# The text after controls WHAT they say.

designs = [
    {
        "name": "young_woman_gentle",
        "description": "A young Nepali woman, gentle and sweet voice, speaking softly",
        "text": "नमस्ते, म तपाईंलाई नेपालको बारेमा बताउन चाहन्छु।",
    },
    {
        "name": "old_man_wise",
        "description": "An elderly Nepali man, deep gravelly voice, speaking slowly and wisely",
        "text": "जीवनमा सबैभन्दा महत्त्वपूर्ण कुरा धैर्य हो, बाबु।",
    },
    {
        "name": "child_excited",
        "description": "A Nepali child, about 8 years old, excited and energetic voice",
        "text": "आमा! हेर्नुस्, हिमालय कति सुन्दर छ! म पहाडमा जान चाहन्छु!",
    },
    {
        "name": "news_anchor_formal",
        "description": "A professional Nepali news anchor, clear and authoritative male voice, formal tone",
        "text": "आजको मुख्य समाचारमा, सरकारले नयाँ विकास योजना घोषणा गरेको छ।",
    },
    {
        "name": "young_man_casual",
        "description": "A young Nepali man in his 20s, casual and friendly voice, slightly fast",
        "text": "भाइ, आज खेल हेर्न जाने हो? मैदानमा ठूलो खेल हुँदैछ!",
    },
    {
        "name": "woman_storyteller",
        "description": "A middle-aged Nepali woman, warm and expressive voice, storytelling tone",
        "text": "एक समयको कुरा हो, हिमालयको कुशमा एउटा सानो गाउँ थियो। त्यहाँ एक जना साहसी केटी बस्थिन्।",
    },
    {
        "name": "teacher_patient",
        "description": "A Nepali female teacher, patient and clear voice, speaking at medium pace, enunciating carefully",
        "text": "अब हामी नेपालको इतिहासको बारेमा पढ्नेछौं। ध्यान दिएर सुन्नुहोस्।",
    },
]

# ── Generate ──────────────────────────────────────────────────────
print(f"\nGenerating {len(designs)} voice designs...\n")

for design in designs:
    name = design["name"]
    desc = design["description"]
    text = design["text"]

    full_text = f"({desc}){text}"

    print(f"🎨 {name}")
    print(f"   Voice: {desc}")
    print(f"   Text:  {text[:60]}...")

    # VoxCPM2 recommends generating 1-3 times for best results
    best_wav = None
    for attempt in range(3):
        wav = model.generate(
            text=full_text,
            cfg_value=2.0,
            inference_timesteps=10,
        )
        if wav is not None:
            best_wav = wav
            break  # take first successful generation

    if best_wav is not None:
        sf.write(str(OUTPUT_DIR / f"{name}.wav"), best_wav, SAMPLE_RATE)
        print(f"   ✓ Saved: {name}.wav ({len(best_wav)/SAMPLE_RATE:.2f}s)")
    else:
        print(f"   ✗ Failed after 3 attempts")
    print()

# ── Also generate same text with different voices for comparison ──
print("=" * 60)
print("Generating same sentence in different voices for comparison...")
SAME_TEXT = "नेपाल एक सुन्दर देश हो।"

voice_variants = [
    ("A young woman, gentle voice", "female_gentle"),
    ("An old man, deep and slow voice", "male_deep"),
    ("A child, cheerful and fast", "child_cheerful"),
    ("A news anchor, formal and clear", "anchor_formal"),
]

for desc, label in voice_variants:
    wav = model.generate(
        text=f"({desc}){SAME_TEXT}",
        cfg_value=2.0,
        inference_timesteps=10,
    )
    if wav is not None:
        sf.write(str(OUTPUT_DIR / f"same_text_{label}.wav"), wav, SAMPLE_RATE)
        print(f"  ✓ {label}: saved")

print(f"\n✓ All voice design outputs in: {OUTPUT_DIR}")
```

### Run

```bash
srun -p gpu -C gpu2h100 --gres=gpu:1 -c 4 --mem=32gb -A mxh605 --pty bash
source ~/voxcpm_env/bin/activate && module load CUDA/12.1.1 Python/3.10.4-GCCcore-11.3.0
cd ~/voxcpm_nepali && python test_voice_design.py
```

### Voice Design tips

1. **Be specific:** "A young woman" → "A young Nepali woman in her 20s, gentle voice with a slight smile"
2. **Mention pace:** "speaking slowly" or "slightly fast" — pace control works well
3. **Emotion matters:** "warm", "cheerful", "serious", "whispering" all produce different outputs
4. **Generate 1-3 times:** Results vary between runs (official recommendation)
5. **`cfg_value=2.0`** is recommended for voice design (higher values = stronger adherence to description)
6. **`inference_timesteps=10`** is the sweet spot for speed vs quality

### Important caveat

Voice Design quality is **description-dependent**. Since VoxCPM2 was trained primarily on English/Chinese descriptions, describing voices as "Nepali-sounding" may not always produce the expected result. Our LoRA adapter helps with pronunciation but doesn't change the voice design conditioning mechanism.

---

## 17. Web Demo (Gradio)

VoxCPM2 includes a built-in Gradio web demo at [`app.py`](https://github.com/OpenBMB/VoxCPM/blob/main/app.py). This gives you a browser-based UI for all 3 modes: TTS, Voice Design, and Voice Cloning.

### Running on the cluster

```bash
# Get an interactive GPU session
srun -p gpu -C gpu2h100 --gres=gpu:1 -c 4 --mem=32gb -A mxh605 --pty bash

# Set up environment
source ~/voxcpm_env/bin/activate
export PATH=~/.local/bin:$PATH
module load CUDA/12.1.1 Python/3.10.4-GCCcore-11.3.0

# Install gradio if not already installed
pip install gradio

# Launch the web demo
cd ~/voxcpm_nepali/voxcpm_repo
python app.py --port 8808
```

### Accessing from your local machine

Since the HPC nodes aren't directly accessible via browser, use SSH tunneling:

```bash
# Terminal on your LOCAL machine:
# First, find which compute node the job is running on
ssh sra42@pioneer.case.edu "squeue -u sra42 -o '%N'"
# Example output: gpu042

# Create SSH tunnel through login node to compute node
ssh -L 8808:gpu042:8808 sra42@pioneer.case.edu

# Now open in browser:
# http://localhost:8808
```

### The WebUI has tabs for:

| Tab | Function | What you do |
|-----|----------|-------------|
| **Text-to-Speech** | Basic TTS | Type text → generate speech |
| **Voice Design** | Create voice from description | Type voice description + text → generate |
| **Voice Cloning** | Clone from reference audio | Upload reference WAV + type text → generate in that voice |
| **Ultimate Cloning** | Highest fidelity | Upload reference WAV + transcript + text → generate |

### LoRA Fine-tuning WebUI

VoxCPM2 also provides a separate fine-tuning WebUI:

```bash
python lora_ft_webui.py  # opens on http://localhost:7860
```

This provides a graphical interface for:
- Uploading training data
- Configuring LoRA parameters
- Starting/stopping training
- Monitoring loss curves
- Testing inference with trained LoRA

> Note: We used the command-line SLURM approach for our training since it's more appropriate for HPC environments, but the WebUI is useful for quick experiments on a local GPU.

---

## 18. Project Directory Structure

```
~/voxcpm_nepali/
├── conf/
│   └── finetune_lora_nepali.yaml       ← Training config
├── checkpoints/
│   └── lora_nepali/
│       ├── latest/
│       │   └── lora_weights.safetensors ← Best/latest LoRA checkpoint (~100-200MB)
│       └── step_*/                      ← Intermediate checkpoints
├── data/
│   ├── raw/
│   │   ├── ne_np_female/               ← OpenSLR-43 (female voice)
│   │   │   ├── line_index.tsv
│   │   │   └── wavs/
│   │   └── male-female-data/           ← OpenSLR-143 (male + female)
│   │       ├── FemaleVoice.tsv
│   │       ├── MaleVoice.tsv
│   │       └── *.wav
│   ├── processed/
│   │   └── wavs_16k/                   ← Resampled to 16kHz mono
│   └── manifests/
│       ├── train.jsonl                  ← Training manifest
│       └── val.jsonl                    ← Validation manifest
├── logs/
│   ├── lora_nepali/                     ← TensorBoard logs
│   └── slurm_*.out / slurm_*.err       ← SLURM job logs
├── pretrained_models/
│   └── VoxCPM2/                         ← Downloaded from HuggingFace (~8 GB)
├── test_outputs/
│   ├── sample_*.wav                     ← Single model test
│   ├── compare/                         ← A/B comparison (original vs fine-tuned)
│   │   ├── XX_original.wav
│   │   ├── XX_finetuned.wav
│   │   └── XX_COMPARE.wav
│   ├── cloning/                         ← Voice cloning outputs
│   │   ├── basic_clone_*.wav
│   │   ├── controllable_clone_*.wav
│   │   └── ultimate_clone_*.wav
│   └── voice_design/                    ← Voice design outputs
│       ├── young_woman_gentle.wav
│       ├── old_man_wise.wav
│       ├── child_excited.wav
│       └── same_text_*.wav
├── voxcpm_repo/                         ← Cloned GitHub repo
│   ├── scripts/train_voxcpm_finetune.py
│   ├── app.py                           ← Gradio web demo
│   └── lora_ft_webui.py                 ← Fine-tuning WebUI
├── prepare_manifest.py
├── test_inference.py
├── test_compare.py
├── test_voice_clone.py
├── test_voice_design.py
└── submit_lora.slurm
```

---

## 19. References & Citation

### VoxCPM2

- **Repository:** [github.com/OpenBMB/VoxCPM](https://github.com/OpenBMB/VoxCPM)
- **Model Weights:** [huggingface.co/openbmb/VoxCPM2](https://huggingface.co/openbmb/VoxCPM2)
- **Documentation:** [voxcpm.readthedocs.io](https://voxcpm.readthedocs.io/en/latest/models/voxcpm2.html)
- **Project Page:** [voxcpm.net](https://voxcpm.net/)
- **ComfyUI Integration:** [Saganaki22/ComfyUI-VoxCPM2](https://github.com/Saganaki22/ComfyUI-VoxCPM2)

```bibtex
@article{voxcpm2_2026,
  title   = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation,
             Creative Voice Design, and True-to-Life Cloning},
  author  = {VoxCPM Team},
  journal = {GitHub},
  year    = {2026},
}

@article{voxcpm2025,
  title   = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation
             and True-to-Life Voice Cloning},
  author  = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
             Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
             Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
  journal = {arXiv preprint arXiv:2509.24650},
  year    = {2025},
}
```

### LoRA

```bibtex
@article{hu2021lora,
  title   = {LoRA: Low-Rank Adaptation of Large Language Models},
  author  = {Hu, Edward J and Shen, Yelong and Wallis, Phillip and
             Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and
             Wang, Lu and Chen, Weizhu},
  journal = {arXiv preprint arXiv:2106.09685},
  year    = {2021},
}
```

### Datasets

- **OpenSLR-43:** [openslr.org/43](https://openslr.org/43/) — Nepali female read speech
- **OpenSLR-143:** [openslr.org/143](https://openslr.org/143/) — Nepali male + female read speech

### Infrastructure

- **CWRU Pioneer HPC:** [ondemand-pioneer.case.edu](https://ondemand-pioneer.case.edu/public/sinfo_pioneer.html)

### Further Reading

- [Medium: VoxCPM2 vs ElevenLabs benchmark analysis](https://medium.com/@tentenco/voxcpm2-the-open-source-voice-model-that-beats-elevenlabs-on-similarity-but-the-full-benchmark-ffe408b50b87)

---

*Generated: April 2026 | CWRU HPC Pioneer Cluster | VoxCPM2 LoRA Fine-tuning for Nepali TTS*