What is a tokenizer? — Nepali benchmark & reference

Tokenizers in scope

GPT family encodings are byte-level BPE (tiktoken). They were optimised for English-heavy web text: Devanagari Nepali often splits into many small pieces compared with Indic-pretrained models.

GPT family

OpenAI · tiktoken

GPT-2

import tiktoken
enc = tiktoken.get_encoding("gpt2")

Mechanism: Byte-level BPE; vocabulary on the order of tens of thousands of merge rules. Historic default for GPT-2–style models.
Nepali relevance: No Indic-specific pretraining; useful as a baseline for “how many tokens does generic English-centric BPE spend on Nepali?”
Docs: openai/tiktoken

OpenAI · tiktoken

GPT-4 / GPT-3.5 family (cl100k_base)

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")

Mechanism: Byte-level BPE used by GPT-3.5 and many GPT-4-era tools. It has a larger merge table than gpt2 and better multilingual coverage in practice, but it is still not a Nepali-specific vocabulary. GPT-4o uses a newer encoding, so this benchmark should be read as cl100k_base, not GPT-4o itself.
Nepali relevance: Often fewer tokens than gpt2 on non-Latin text, but still not designed around Devanagari morphology.
Docs: openai/tiktoken (encoding names)

Indic checkpoints differ in whether Nepali was in the pretraining mix. MuRIL explicitly lists Nepali (ne) among its supported languages. The public IndicBERT and IndicBART cards do not expose Nepali as a primary supported language/tag in the same way, but their tokenizers still encode Devanagari strings and are useful Indic subword comparisons.

Indic family (Hugging Face)

AI4Bharat

IndicBERT

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("ai4bharat/indic-bert")

Model: Multilingual ALBERT-style encoder; ~9B tokens across 12 Indian languages (see model card).
Tokenizer: SentencePiece-style subword tokenizer loaded through Hugging Face AutoTokenizer; the visible ▁ marks word starts. Check the model revision and tokenizer files before comparing runs.
Card: huggingface.co/ai4bharat/indic-bert

Google

MuRIL

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("google/muril-base-cased")

Model: BERT-base architecture; 17 Indian languages including Nepali; uses translation and transliteration pairs in pretraining.
Tokenizer: Cased WordPiece (same broad family as multilingual BERT).
Card: huggingface.co/google/muril-base-cased

AI4Bharat

IndicBART

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(
    "ai4bharat/IndicBART",
    do_lower_case=False,
    use_fast=False,
    keep_accents=True,
)

Model: Smaller mBART-style seq2seq model for Indic NLG; IndicCorp-scale pretraining.
Tokenizer: SentencePiece-backed vocabulary (model card notes SentencePiece rather than classic mBART BPE; often loaded via AlbertTokenizer in older examples). Language tags like <2hi> matter for training-time format.
Card: huggingface.co/ai4bharat/IndicBART

Nepali-specific (project)

Local / custom

Nepali WordPiece

# Your pipeline, e.g.
tok = WordPiece()

Role: WordPiece over a Nepali (or mixed Nepali–English) vocabulary file — typical for BERT-style TTS or LM front-ends you train yourself.
What to record in the benchmark: Vocab path, cased vs uncased, unknown token rate on the corpus rows below.

Local / custom

Nepali SentencePiece

# Your pipeline, e.g.
tok = SentencePiece()

Role: Unigram or BPE SentencePiece model trained on Nepali-dominated text — common for seq2seq TTS, ASR, or compact lexicons.
What to record in the benchmark: Model path (.model), vocabulary size, and whether normalization NFC/NFD is applied before encode.

Subtle point: a Nepali-specific tokenizer is not automatically better for every model. It is better only if its vocabulary, normalization, and special-token policy match the model you train or adapt. Once a model has learned embeddings for one vocabulary, you cannot swap in a different tokenizer without also changing and retraining the affected embedding/output layers.

Rank	Tokenizer	Total tokens	Fertility ↓	Chars/token ↑	Avg tok/sent	Max tok/sent	Speed (tok/s) ↑
1	MuRIL	154	1.23	3.9	8.6	18	78796
2	Nepali WP	183	1.46	3.28	10.2	21	124887
3	Nepali SP	192	1.54	3.12	10.7	23	180428
4	IndicBART	249	1.99	2.41	13.8	31	71172
5	IndicBERT	253	2.02	2.37	14.1	29	107245
6	GPT-4 / GPT-3.5 (cl100k)	727	5.82	0.83	40.4	87	466440
7	GPT-2	1136	9.09	0.53	63.1	147	450774

Tokenizer	NSL avg ↓	NSL std	Interpretation
MuRIL	0.82	0.11	Shorter than baseline
Nepali WP	1	0	Baseline
Nepali SP	1.04	0.08	Close to baseline
IndicBART	1.25	0.37	Moderate overhead
IndicBERT	1.28	0.38	Moderate overhead
GPT-4 / GPT-3.5 (cl100k)	3.63	1.21	High overhead
GPT-2	5.65	1.97	Very high overhead

Corpus row	G2	cl100k	IB	MuRIL	IBrt	N-WP	N-SP
Honor • Low (तँ) — You eat rice	14	11	8	7	7	7	7
Honor • Medium (तिमी) — You eat rice	15	11	7	6	6	7	7
Honor • High (तपाईं) — You eat rice	19	15	9	5	8	7	7
Honor • Reverential (हजुर) — You eat rice	18	15	8	5	7	7	7
Honor • Low (तँ) — Where are you going?	14	11	6	6	6	7	7
Honor • Medium (तिमी) — Where are you going?	15	11	5	5	5	7	7
Honor • High (तपाईं) — Where are you going?	19	15	7	5	7	7	7
Honor • Reverential (हजुर) — Where are you going?	18	15	6	5	6	7	7
Simple • Greeting	22	18	10	6	11	8	10
Simple • Short sentence	14	11	7	5	7	7	7
Simple • Weather	15	13	7	5	7	7	7
News • Education policy	49	38	19	11	20	13	13
News • Traffic	63	48	26	12	24	13	13
News • Parliament	44	36	17	9	16	9	11
Tech • AI/ML	67	51	29	18	29	21	23
Tech • Digital economy	68	54	27	14	26	14	15
Literary • Himalayas	77	59	29	16	31	19	21
Literary • Success	62	53	26	14	26	16	16

Metric glossary (plain but precise)

Tokenizers in scope

GPT family

GPT-2

GPT-4 / GPT-3.5 family (cl100k_base)

Indic family (Hugging Face)

IndicBERT

MuRIL

IndicBART

Nepali-specific (project)

Nepali WordPiece

Nepali SentencePiece

Evaluation corpus (Nepali)

Honor • Low (तँ) — You eat rice

Honor • Medium (तिमी) — You eat rice

Honor • High (तपाईं) — You eat rice

Honor • Reverential (हजुर) — You eat rice

Honor • Low (तँ) — Where are you going?

Honor • Medium (तिमी) — Where are you going?

Honor • High (तपाईं) — Where are you going?

Honor • Reverential (हजुर) — Where are you going?

Simple • Greeting

Simple • Short sentence

Simple • Weather

News • Education policy

News • Traffic

News • Parliament

Tech • AI/ML

Tech • Digital economy

Literary • Himalayas

Literary • Success

Overall metrics

Leaderboard

NSL — tokens vs Nepali WordPiece

Charts

Fertility Score (tokens per word — lower = better)

Chars per Token (info per token — higher = better)

NSL Score (vs Nepali WordPiece baseline — lower = shorter)

Speed (tokens/sec — higher = better)

Honorific register analysis

“You eat rice” — token count by honorific level

“Where are you going?” — token count by honorific level

Score sheet