Listen — Nepali voice samples (LoRA diagnostics)

Nepali LoRA Listening Lab

An interactive, curiosity-driven exploration of Nepali Low-Rank Adaptation (LoRA). Use this lab bench to physically analyze the speech frequencies and weights of the fine-tuned voice model.

Every voice clip on this page is generated. The only human-recorded audio is the 3-second reference clip in the cloning tab, which provides the speaker's physical timbre.

How LoRA works

A pretrained voice model stores its knowledge in large weight matrices $W_0$. Retraining all of them for Nepali would be expensive and risk forgetting what the model already knows. LoRA leaves $W_0$ frozen and learns a small correction $\Delta W = BA$ through a narrow rank-$r$ channel.

W₀ + α/r · B × A

Three listening tests

Compare Same Nepali sentence before and after the LoRA adapter.

Clone demo A reference voice reading new Nepali lines with adapted pronunciation.

Voice design New voices generated from short English persona descriptions.

Three listening tests, one tab each: Compare (base vs LoRA on the same sentence), Clone · demo (an early reference speaker), and Voice design (no reference — persona from an English line). Jump tabs anytime; audio in hidden tabs is paused so only what you see is playing.

1 · Same sentence, before and after

Each row plays the same Nepali sentence twice. Original is the base text-to-speech model on its own. Fine-tuned is the same model with the Nepali LoRA adapter loaded on top. Words highlight as they play.

01 — greeting

नमस्ते।

Before — base model

After — with Nepali LoRA

02 — weather & movement

आज मौसम धैरे राम्रो छ, हामी बाहिर घुम्न जाऔं।

Before — base model

After — with Nepali LoRA

03 — geography & regions

नेपाल एक सुन्दर देश हो जहाँ हिमालय, पहाड र तराई क्षेत्र छन्।

Before — base model

After — with Nepali LoRA

04 — education policy

सरकारले नयाँ शिक्षा नीति लागू गर्ने घोषणा गरेको छ, जसले विद्यार्थीहरूको भविष्य उज्यालो बनाउने अपेक्षा गरिएको छ।

Before — base model

After — with Nepali LoRA

05 — wise old man

एक समयको कुरा हो, एउटा सानो गाउँमा एक जना बुद्धिमान वृद्ध मानिस बस्थे। उनले आफ्नो जीवनभर धेरै कठिनाइहरू झेलेका थिए, तर कहिल्यै हार मानेनन्।

Before — base model

After — with Nepali LoRA

06 — area & stats

नेपालको क्षेत्रफल एक लाख सत्तरी हजार वर्ग किलोमिटर छ र जनसंख्या लगभग तीन करोड छ।

Before — base model

After — with Nepali LoRA

07 — inquiries

तपाईंको नाम के हो? तपाईं कहाँबाट आउनुभयो? के तपाईंलाई नेपाली खाना मनपर्छ?

Before — base model

After — with Nepali LoRA

08 — mother's love

आमाको माया संसारमा सबैभन्दा ठूलो माया हो। उनको आँचलमा सुत्दा संसारका सबै दुःख बिर्सिन्छन्।

Before — base model

After — with Nepali LoRA

2 · A cloned voice, now speaking Nepali

This experiment isolates **identity (reference clip)** from **skill (the LoRA)**. The reference speaker is a short english recording. The model copies this specific timbre, and uses the LoRA weights to pronounce new Nepali text with natural phonology.

Reference speaker

A 3-second recording of the target voice. This identical anchor controls the synthesized speaker identity for all files below.

Reference clip (real human voice)

01 — sentence in the cloned voice

नमस्ते, मेरो नाम सारा हो।

In the cloned voice

02 — sentence in the cloned voice

आज मौसम धेरै राम्रो छ।

In the cloned voice

03 — sentence in the cloned voice

नेपाल एक सुन्दर देश हो जहाँ हिमालय, पहाड र तराई क्षेत्र छन्।

In the cloned voice

04 — sentence in the cloned voice

सरकारले नयाँ शिक्षा नीति लागू गर्ने घोषणा गरेको छ।

In the cloned voice

05 — sentence in the cloned voice

आमाको माया संसारमा सबैभन्दा ठूलो माया हो।

In the cloned voice

3 · A voice designed from a text description

This test invents brand-new identities entirely from a single-line English description (like "warm storyteller" or "elderly deep man"). No reference audio is used.

Voice Manifold Diagnostics

The Gravitational Pull of Dataset Bias

Acoustic Drift Phenomenon: Rows 02, 03, and 08 demonstrate a fascinating model behavior. Despite requesting deep, masculine voices, the model synthesized higher-pitched, lighter timbres. In deep learning research, this represents an "attractor state" where a highly skewed female dataset acts as an acoustic gravity well, pulling male prompts toward feminine formants.

Acoustic Vector Interference

1. Desired Vector "Deep, elderly, calm man" (low fundamental frequency $F_0 \approx 100\text{Hz}$).

2. Manifold Skew Fine-tuning dataset has heavy high-pitch representation.

3. Final Orbit The speaker vector collapses toward the high density region.

Fine-Tuning Dataset Balance

65/35 Male/Female Split

Female representation~65%
Male representation~35%

Voice Data Breakdown (Hours of Audio)

OpenSLR-43 ne_np_female female only

OpenSLR-143 FemaleVoice.tsv ~2.5 hours

OpenSLR-143 MaleVoice.tsv ~2.5 hours

Row 02

Prompt not followed

Deep, authoritative, calm elderly man

🔬 Load Drift in Lab

Row 03

Prompt not followed

Energetic, enthusiastic young man

🔬 Load Drift in Lab

Row 08

Prompt not followed

Measured, authoritative male teacher

🔬 Load Drift in Lab

Next Round Mitigation Strategies

Balance Splitting Acquire more high-quality male audio datasets to offset the OpenSLR-43 female bias.

Explicit Tags Prepend explicit speakers tokens (e.g. [speaker:male_deep]) inside training transcription manifests.

LoRA Expansion Increase LoRA rank $r$ from 8 to 16/32 to allow capacity for wider multi-speaker parameters.

Manifold Penalty Introduce a speaker-regularization penalty during training backprop to freeze gender-identifying layers.

01 young_female_gentle

(A young woman with a gentle, sweet, and warm voice, speaking slowly and clearly)

नमस्ते, मेरो नाम सीता हो। म नेपालबाट बोल्दै छु।

Designed voice

02 old_male_deep

Issue: prompt not followed

(An elderly man with a deep, authoritative, and calm voice)

नेपाल एक सुन्दर देश हो जहाँ हिमालय, पहाड र तराई क्षेत्र छन्।

Designed voice

03 young_male_energetic

Issue: prompt not followed

(A young man with an energetic, enthusiastic, and fast-paced voice)

आज मौसम धेरै राम्रो छ, हामी बाहिर घुम्न जाऔं!

Designed voice

04 female_newsreader

(A professional female news anchor, clear articulation, neutral tone, moderate pace)

Designed voice

05 female_storyteller

(A warm, motherly woman telling a bedtime story, soft and soothing voice)

एक समयको कुरा हो, एउटा सानो गाउँमा एक जना बुद्धिमान वृद्ध मानिस बस्थे।

Designed voice

06 male_cheerful

(A cheerful middle-aged man, slightly smiling, friendly and inviting tone)

तपाईंलाई नेपाली खाना मनपर्छ? आउनुहोस्, हामीसँग खाना खानुहोस्!

Designed voice

07 female_sad_emotional

(A young woman, sad and emotional, speaking slowly with a trembling voice)

Designed voice

08 male_formal_teacher

Issue: prompt not followed

(A male teacher, calm and measured, explaining clearly with authority)

Designed voice

Base weights ($d^2$)	1,048,576
LoRA params ($2dr$)	16,384
Compression ratio	64× fewer trained
Effective scale	α/r = 2.0

Nepali LoRA Listening Lab

How LoRA works

Matrix shapes

Parameter budget

Three listening tests

1 · Same sentence, before and after

2 · A cloned voice, now speaking Nepali

3 · A voice designed from a text description

The Gravitational Pull of Dataset Bias

Acoustic Vector Interference

Fine-Tuning Dataset Balance

Voice Data Breakdown (Hours of Audio)

Next Round Mitigation Strategies