Tokenization

Why models see tokens — not characters or words — and how that shapes cost, context limits, and a model's odd blind spots.

A language model never sees your text the way you do. Before it can predict anything, your words are cut into tokens — chunks that are often whole words, but just as often fragments, punctuation, or a single space stuck to the front of a word. The model only ever reads and writes these tokens; characters and words are a human convenience layered on top.

The clearest way to understand this is to watch it happen. Type anything below and see exactly how today's models (this uses GPT-4o's tokenizer) carve it up.

Try it · liveGPT-4o tokenizer (o200k_base)
0 tokens217 characters chars / token
Loading tokenizer…

Each coloured chip is one token. Hover a chip to see its position and numeric id. The same characters can cost very different token counts — which is exactly what you pay for and what fills the context window.

Three things to notice

Common words are cheap; rare ones aren't. “the”, “model”, and “token” are each a single token. But a long or unusual word — try supercalifragilisticexpialidocious — shatters into many pieces, because the tokenizer only learned merges for sequences it saw often during training.

A space is part of the token. Notice that “ tokenization” (with a leading space) is usually a different token from “tokenization” at the very start of the text. That little dot in the chips marks where a space lives — models think in “␣word” units far more than in bare words.

Numbers and code are weird. Try a big number like 1,000,000 or a date — they rarely split into neat digits, which is a big part of why models historically fumbled arithmetic. Code, whitespace, and non-English scripts all tokenize less efficiently too.

Why this matters in practice

Tokens are the unit of almost everything you care about. You're billed per token, in and out. A model's context window — how much it can “hold in mind” at once — is measured in tokens, not characters, so the chars-per-token ratio above directly decides how much of a document actually fits. And because the model reasons over tokens, quirks like the arithmetic stumbles above fall straight out of how text got split in the first place.

A useful rule of thumb for English prose: roughly 4 characters per token, or about ¾ of a word. But as you can see by pasting your own text, the only way to really know is to tokenize it.


More interactive modules are on the way — sampling & temperature, embeddings, and the agent loop. Back to all modules →

© 2026 Rishabh Mehan · All rights reserved · Built with Next.js and a little stubbornness.