Tokenizers: from text to numbers

A tokenizer is the boundary between human text and a language model’s vocabulary. It normalizes and splits raw Unicode text into tokens, then maps those tokens to integer IDs. The transformer itself does not reason over characters directly: those IDs are looked up in an embedding table, processed as vectors, and later decoded back through the same vocabulary.

⚛️ Tokens ≈ Atoms

Discrete building blocks. Like atoms, tokens are the discrete units the model can address by number. Every input and output is a flat sequence of integer IDs, one per token.

  • The difference: Physical atoms are discovered; BPE tokens are learned from corpus statistics.
  • The "periodic table" changes completely if you change the training corpus or target language.

🔢 Tokens ≈ Primes

Irreducible basis. Every integer is built by multiplying primes together. The model's job is to learn the "grammar of token combinations" just as number theory studies prime combinations.

  • Rarity and unevenness: Small primes do heavy lifting. Similarly, a small number of common tokens (like " the") carry enormous frequency weight.
  • Composition is the whole game: The interesting structure comes from how tokens combine into meaning.
Start here

Text becomes integer IDs

The tokenizer splits raw text into pieces, then maps each piece to an integer. The model never sees letters — only those numbers.

"नमस्ते" नमस्ते 4821
Then compare

Fewer tokens = cheaper, longer context

Context windows and API costs are counted in tokens. A Nepali-aware tokenizer packs the same text into far fewer tokens than a generic one.

Compact · 1
Fragmented · 5
Visual intuition

The same word, easy or expensive

A tokenizer trained on Nepali text may keep नमस्ते as one unit. A generic tokenizer with no Nepali in its training corpus may split the same word into five byte-sized fragments — five IDs, five embedding lookups, five attention positions.

Nepali-aware · 1 token
नमस्ते
Generic · 5 tokens
स्

The atom lab: see tokens appear

Pick a sample and press Tokenize! Each highlighted chunk is one token — one vocabulary entry the model can address by ID. This demo uses hand-drawn splits to teach the idea. For real tokenizer output on Nepali corpus sentences, use the Token visualization just below.

Where the tokenizer sits in an LLM

Tokenization happens before the transformer reads text and after the transformer predicts output IDs. The tokenizer is not the model’s reasoning engine, but it sets the alphabet of possible moves. Change the tokenizer after training and the learned embedding rows no longer mean what the model expects.

  1. Encode input Text is normalized, split into tokens, and converted to integer IDs.
  2. Embed IDs Each ID selects a learned vector from the embedding table.
  3. Run transformer Attention and feed-forward layers process the vector sequence.
  4. Predict next ID The model outputs scores over the tokenizer’s vocabulary.
  5. Decode text Sampled or chosen IDs are mapped back to token strings and joined.

Sample

Raw text

Hello, world!

Tokens (pieces)

Press “Tokenize!” to animate tokens.
Why does it say Ġworld — where did the space go?

GPT-family tokenizers (BPE-style) do not emit a standalone space token followed by world. Instead, the space before a word is merged into the first piece of that word during training. Tools display this merged-in space as the Unicode character Ġ (U+0120). So Ġworld is one token, representing space + world — the space has not vanished, it is encoded inside.

Input chars
H e l l o , SPACE w o r l d !
Tokens
Hello , Ġworld !

The blue SPACE character is absorbed into the highlighted Ġworld token. No information is lost — Ġ is just the display convention that says "this token starts a new word after whitespace."

If you write Hello,  world ! with extra spaces or different punctuation, the tokenizer may split differently — additional space tokens may appear, or punctuation may shift boundaries. That is expected: BPE merge rules were learned from specific whitespace patterns in the training corpus.

Why “how many atoms?” matters

  • Cost & speed Transformer compute and memory grow with sequence length. More tokens for the same text usually means more work, more latency, and higher bills when systems charge per token.
  • Context window Context windows are counted in tokens, not words or characters. If Devanagari is chopped into many tiny pieces, the same token budget holds less Nepali content.
  • Vocabulary fit A tokenizer trained with Nepali in the mix can learn frequent Devanagari words, stems, suffixes, and punctuation patterns. The goal is not always one word per token; it is stable, reusable pieces that do not waste sequence length.

Token visualization

Interactive: choose a corpus sentence, then turn tokenizers on or off to compare side by side. These are real splits from seven benchmark tokenizers — the same sentences listed in the corpus section further down.

Look for boundary conventions: ## marks WordPiece continuations, marks a SentencePiece word boundary, and special tokens such as [CLS], [SEP], <cls>, or <sep> are real sequence pieces for those setups. Byte-level BPE pieces may look like character fragments because Devanagari code points are represented through byte-derived merges.

Tokenizers

Recommended path: try the atom lab, then play with real tokenizer splits. When you want to understand the numbers, read metrics → models → corpus → results → charts → honorifics → score sheet.

Use these cards as a guided route. The sticky bar appears while you scroll and keeps the same order.

  1. Token visualization Pick a Nepali sentence and compare how seven real tokenizers split it — start here to play.
  2. Metrics What fertility, chars/token, NSL, and speed mean before you read numbers.
  3. Tokenizers Who is GPT, Indic, or Nepali-tuned — and how to load each one.
  4. Corpus The exact Nepali sentences every model encodes (honorifics first).
  5. Tables Leaderboard and NSL vs Nepali WordPiece — the aggregate story.
  6. Bar charts Same metrics, easier to scan at a glance.
  7. Honorifics How respect level changes token count — toggle lines to compare.
  8. Score sheet Per-sentence token counts aligned with the visualization at the top.

Metric glossary (plain but precise)

These metrics describe the tokenizer output, not the intelligence of the model using it. They are only comparable when the same text, normalization, word counting, and special-token policy are used for every tokenizer.

Fertility

Average number of tokens produced for each counted word.

fertility = tokens ÷ words Lower usually means shorter model sequences.
Good (Fertility = 1.0)
"apple" apple
1 word packed into 1 token box
Poor (Fertility = 3.0)
"unbelievable" un believ able
1 word packed into 3 token boxes

Think of fertility as "boxes per item". You want to pack your words into as few token boxes as possible.

Caveat: 1.0 is not automatically best. Good subword suffixes can help; special tokens like [CLS]/[SEP] can also raise counts.

Chars / token

Average number of Unicode characters packed into each token.

chars/token = characters ÷ tokens Higher often means less fragmentation.
🚌
High (5.0 chars/token)
H e l l o
1 token carrying
5 characters
🚗🚗🚗
Low (1.0 chars/token)
H e l l o
5 tokens carrying
1 character each

Think of tokens as vehicles. A bus (high chars/token) carries many characters efficiently. Single-occupant cars (low chars/token) cause traffic jams in the model.

Caveat: very large tokens are not always better. A coarse token can hide suffixes, morphology, or spacing distinctions the model may need.

NSL (normalized sequence length)

Sequence length compared to a baseline tokenizer for the same sentence.

NSL = tokenizer tokens ÷ baseline tokens Here the baseline is Nepali WordPiece = 1.0.
Standard
Fits in 100 boxes
1.0x
Vacuum
Fits in 80 boxes
0.8x
Messy
Needs 250 boxes
2.5x

Think of NSL as packing efficiency compared to a baseline. 0.8x is 20% more efficient, while 2.5x takes up 2.5 times more space.

Caveat: NSL is relative. Changing the baseline changes the number, so use it to compare tokenizers within the same experiment.

Speed

How many tokens per second the tokenizer implementation processed in this benchmark run.

speed = tokens processed ÷ seconds This measures preprocessing throughput, not model reasoning.
🚶
Human Reading
~250
words / min
🚀
Computer Scanning
~500,000
words / min

Think of speed as a factory conveyor belt. A slow tokenizer is a bottleneck, delaying the powerful AI model waiting for the text to be converted to numbers.

Caveat: speed depends on implementation details: hardware, batching, caching, Python vs Rust backend, and vocabulary size.

Tokenizers in scope

GPT family encodings are byte-level BPE (tiktoken). They were optimised for English-heavy web text: Devanagari Nepali often splits into many small pieces compared with Indic-pretrained models.

GPT family

OpenAI · tiktoken

GPT-2

import tiktoken
enc = tiktoken.get_encoding("gpt2")
Mechanism
Byte-level BPE; vocabulary on the order of tens of thousands of merge rules. Historic default for GPT-2–style models.
Nepali relevance
No Indic-specific pretraining; useful as a baseline for “how many tokens does generic English-centric BPE spend on Nepali?”
Docs
openai/tiktoken
OpenAI · tiktoken

GPT-4 / GPT-3.5 family (cl100k_base)

import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
Mechanism
Byte-level BPE used by GPT-3.5 and many GPT-4-era tools. It has a larger merge table than gpt2 and better multilingual coverage in practice, but it is still not a Nepali-specific vocabulary. GPT-4o uses a newer encoding, so this benchmark should be read as cl100k_base, not GPT-4o itself.
Nepali relevance
Often fewer tokens than gpt2 on non-Latin text, but still not designed around Devanagari morphology.
Docs
openai/tiktoken (encoding names)

Indic checkpoints differ in whether Nepali was in the pretraining mix. MuRIL explicitly lists Nepali (ne) among its supported languages. The public IndicBERT and IndicBART cards do not expose Nepali as a primary supported language/tag in the same way, but their tokenizers still encode Devanagari strings and are useful Indic subword comparisons.

Indic family (Hugging Face)

AI4Bharat

IndicBERT

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("ai4bharat/indic-bert")
Model
Multilingual ALBERT-style encoder; ~9B tokens across 12 Indian languages (see model card).
Tokenizer
SentencePiece-style subword tokenizer loaded through Hugging Face AutoTokenizer; the visible marks word starts. Check the model revision and tokenizer files before comparing runs.
Card
huggingface.co/ai4bharat/indic-bert
Google

MuRIL

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("google/muril-base-cased")
Model
BERT-base architecture; 17 Indian languages including Nepali; uses translation and transliteration pairs in pretraining.
Tokenizer
Cased WordPiece (same broad family as multilingual BERT).
Card
huggingface.co/google/muril-base-cased
AI4Bharat

IndicBART

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(
    "ai4bharat/IndicBART",
    do_lower_case=False,
    use_fast=False,
    keep_accents=True,
)
Model
Smaller mBART-style seq2seq model for Indic NLG; IndicCorp-scale pretraining.
Tokenizer
SentencePiece-backed vocabulary (model card notes SentencePiece rather than classic mBART BPE; often loaded via AlbertTokenizer in older examples). Language tags like <2hi> matter for training-time format.
Card
huggingface.co/ai4bharat/IndicBART

Nepali-specific (project)

Local / custom

Nepali WordPiece

# Your pipeline, e.g.
tok = WordPiece()
Role
WordPiece over a Nepali (or mixed Nepali–English) vocabulary file — typical for BERT-style TTS or LM front-ends you train yourself.
What to record in the benchmark
Vocab path, cased vs uncased, unknown token rate on the corpus rows below.
Local / custom

Nepali SentencePiece

# Your pipeline, e.g.
tok = SentencePiece()
Role
Unigram or BPE SentencePiece model trained on Nepali-dominated text — common for seq2seq TTS, ASR, or compact lexicons.
What to record in the benchmark
Model path (.model), vocabulary size, and whether normalization NFC/NFD is applied before encode.

Subtle point: a Nepali-specific tokenizer is not automatically better for every model. It is better only if its vocabulary, normalization, and special-token policy match the model you train or adapt. Once a model has learned embeddings for one vocabulary, you cannot swap in a different tokenizer without also changing and retraining the affected embedding/output layers.

Evaluation corpus (Nepali)

Fixed strings for comparing honorific register, plain sentences, news-like complexity, technical vocabulary, and literary style. The point is controlled comparison: every tokenizer sees the same Unicode strings, including punctuation and special marks such as the zero-width non-joiner in संसद्‌मा.

Group 1 — Honorifics (same meaning, four respect levels)

Honor • Low (तँ) — You eat rice

तँ भात खान्छस् ।

Honor • Medium (तिमी) — You eat rice

तिमी भात खान्छौ ।

Honor • High (तपाईं) — You eat rice

तपाईं भात खानुहुन्छ ।

Honor • Reverential (हजुर) — You eat rice

हजुर भात खानुहुन्छ ।

Honor • Low (तँ) — Where are you going?

तँ कहाँ जान्छस् ?

Honor • Medium (तिमी) — Where are you going?

तिमी कहाँ जान्छौ ?

Honor • High (तपाईं) — Where are you going?

तपाईं कहाँ जानुहुन्छ ?

Honor • Reverential (हजुर) — Where are you going?

हजुर कहाँ जानुहुन्छ ?

More evaluation sentences Simple, news, technical, and literary rows — collapse this block if you only need honorifics above.

Group 2 — General Nepali

Simple • Greeting

नमस्ते, तपाईंलाई कस्तो छ ?

Simple • Short sentence

रामले भात खायो ।

Simple • Weather

आज मौसम राम्रो छ ।

Group 3 — News / complex

News • Education policy

नेपाल सरकारले नयाँ शिक्षा नीति लागू गर्ने निर्णय गरेको छ ।

News • Traffic

काठमाडौं उपत्यकामा यातायात व्यवस्थापनका लागि नयाँ योजना तयार पारिएको छ ।

News • Parliament

प्रधानमन्त्रीले संसद्‌मा नयाँ बजेट प्रस्तुत गरे ।

Group 4 — Technical

Tech • AI/ML

कृत्रिम बुद्धिमत्ता र मेसिन लर्निङले नेपाली भाषा प्रशोधनमा नयाँ आयाम थपेको छ ।

Tech • Digital economy

डिजिटल प्रविधिको विकाससँगै सूचना प्रविधि क्षेत्रमा रोजगारीका अवसर बढेका छन् ।

Group 5 — Literary

Literary • Himalayas

हिमालयको काखमा बसेको यो सुन्दर देशले विश्वलाई आफ्नो प्राकृतिक सौन्दर्यले मोहित पारेको छ ।

Literary • Success

मानिसको जीवनमा सफलता पाउन परिश्रम, धैर्य र दृढ संकल्पको आवश्यकता हुन्छ ।

Overall metrics

Aggregates from encoding the full evaluation corpus above. Rank orders by total tokens: fewer tokens means shorter model sequences for this corpus. That is useful, but not the whole story. A tokenizer also needs low unknown-token rates, stable normalization, and pieces that preserve distinctions important for the downstream model.

Leaderboard

Rank Tokenizer Total tokens Fertility ↓ Chars/token ↑ Avg tok/sent Max tok/sent Speed (tok/s) ↑
1 MuRIL 154 1.23 3.9 8.6 18 78796
2 Nepali WP 183 1.46 3.28 10.2 21 124887
3 Nepali SP 192 1.54 3.12 10.7 23 180428
4 IndicBART 249 1.99 2.41 13.8 31 71172
5 IndicBERT 253 2.02 2.37 14.1 29 107245
6 GPT-4 / GPT-3.5 (cl100k) 727 5.82 0.83 40.4 87 466440
7 GPT-2 1136 9.09 0.53 63.1 147 450774

Speed is implementation-dependent (hardware, batching, tokenizer backend); it is shown here as observed in the benchmark run, not as intrinsic model quality. Token count affects transformer cost later, while tokenizer speed measures only the text-to-ID preprocessing step.

NSL — tokens vs Nepali WordPiece

How long each tokenizer’s sequence is compared with the Nepali WordPiece baseline (NSL = 1.0). 1.0 = same as baseline  |  3.0 = three times as many tokens  |  0.8 = shorter than baseline.

Tokenizer NSL avg ↓ NSL std Interpretation
MuRIL 0.82 0.11 Shorter than baseline
Nepali WP 1 0 Baseline
Nepali SP 1.04 0.08 Close to baseline
IndicBART 1.25 0.37 Moderate overhead
IndicBERT 1.28 0.38 Moderate overhead
GPT-4 / GPT-3.5 (cl100k) 3.63 1.21 High overhead
GPT-2 5.65 1.97 Very high overhead

MuRIL below 1.0 on NSL means fewer total tokens than the Nepali WP baseline on this corpus, while the leaderboard rank still reflects absolute total tokens (154 vs 183).

Charts

Same numbers as the tables above. Each chart uses its own 0–max scale, so compare rows within one chart rather than bar lengths across different charts. For fertility and NSL, shorter is usually better; for chars/token and speed, longer is usually better.

Fertility Score (tokens per word — lower = better)

Bar length vs max in corpus · max 9.09

MuRIL1.23
Nepali WP
1.46
Nepali SP
1.54
IndicBART
1.99
IndicBERT
2.02
GPT-4 / GPT-3.5 (cl100k)
5.82
GPT-2
9.09

Chars per Token (info per token — higher = better)

Bar length vs max in corpus · max 3.9

MuRIL
3.9
Nepali WP
3.28
Nepali SP
3.12
IndicBART
2.41
IndicBERT
2.37
GPT-4 / GPT-3.5 (cl100k)
0.83
GPT-2
0.53

NSL Score (vs Nepali WordPiece baseline — lower = shorter)

Bar length vs max in corpus · max 5.65

MuRIL
0.82
Nepali WP
1.0
Nepali SP
1.04
IndicBART
1.25
IndicBERT
1.28
GPT-4 / GPT-3.5 (cl100k)
3.63
GPT-2
5.65

Speed (tokens/sec — higher = better)

Bar length vs max in corpus · max 466,440

GPT-4 / GPT-3.5 (cl100k)
466,440
GPT-2
450,774
Nepali SP
180,428
Nepali WP
124,887
IndicBERT
107,245
MuRIL
78,796
IndicBART
71,172

Charts use the same benchmark run as the tables. Speed bars use a distinct tint to remind that raw tok/s depends on hardware and implementation.

Honorific register analysis

These pairs keep the intended meaning roughly constant while changing Nepali register: pronouns and verb morphology move from informal to respectful. A tokenizer is not measuring politeness directly; the chart shows how the changed surface forms affect token count. Vertical axis is one sentence’s token count; horizontal axis is fixed left-to-right: Low → Medium → High → Reverential.

Show tokenizer lines (both charts)

“You eat rice” — token count by honorific level

Counts match the visualization section. A flat line means the tokenizer handled these register changes with the same sequence length, not that the forms are linguistically identical.

You eat rice: token count vs honorific level for seven tokenizers Line chart with y-axis from 0 to 20 tokens and x-axis four honor levels. Nepali WordPiece and SentencePiece are flat at seven tokens. GPT-2 peaks at high and reverential registers.
Lowतँ
Mediumतिमी
Highतपाईं
Rev.हजुर

Nepali WP and SP share the same counts here (7 each); the dashed line traces SentencePiece on top of WordPiece. MuRIL’s count falls because these polite surface forms happen to match larger learned pieces.

“Where are you going?” — token count by honorific level

Same four honor levels; the question form changes pronoun and verb morphology with register. Colors match the toggles above.

Where are you going: token count vs honorific level for seven tokenizers Line chart with y-axis 0 to 20 tokens. GPT-2 and cl100k_base show a sharp rise at high register. IndicBERT and IndicBART dip at medium then rise.
Lowतँ
Mediumतिमी
Highतपाईं
Rev.हजुर

Byte-level BPE models (GPT-2 and cl100k_base) jump at High / Reverential when polite verb forms add characters. Nepali-specific tokenizers stay flat at seven tokens across registers on these two sentences.

Score sheet

Per-sentence token counts aligned with the visualization at the top of the page. Counts include special tokens when the configured tokenizer emits them, because those IDs occupy positions in the sequence. If you change normalization, special-token settings, corpus text, or tokenizer revision, regenerate this table before interpreting the aggregate metrics.

Corpus row G2 cl100k IB MuRIL IBrt N-WP N-SP
Honor • Low (तँ) — You eat rice141187777
Honor • Medium (तिमी) — You eat rice151176677
Honor • High (तपाईं) — You eat rice191595877
Honor • Reverential (हजुर) — You eat rice181585777
Honor • Low (तँ) — Where are you going?141166677
Honor • Medium (तिमी) — Where are you going?151155577
Honor • High (तपाईं) — Where are you going?191575777
Honor • Reverential (हजुर) — Where are you going?181565677
Simple • Greeting221810611810
Simple • Short sentence141175777
Simple • Weather151375777
News • Education policy49381911201313
News • Traffic63482612241313
News • Parliament443617916911
Tech • AI/ML67512918292123
Tech • Digital economy68542714261415
Literary • Himalayas77592916311921
Literary • Success62532614261616

Hover (or long-press) column headers for full tokenizer names. Totals in overall metrics sum these rows (same preprocessing). Refresh the numbers when your benchmark run or tokenizer setup changes.