Tokenization
Why models see tokens — not characters or words — and how that shapes cost, context limits, and a model's odd blind spots.
A language model never sees your text the way you do. Before it can predict anything, your words are cut into tokens — chunks that are often whole words, but just as often fragments, punctuation, or a single space stuck to the front of a word. The model only ever reads and writes these tokens; characters and words are a human convenience layered on top.
The clearest way to understand this is to watch it happen. Type anything below and see exactly how today's models (this uses GPT-4o's tokenizer) carve it up.
Each coloured chip is one token. Hover a chip to see its position and numeric id. The same characters can cost very different token counts — which is exactly what you pay for and what fills the context window.
Three things to notice
Common words are cheap; rare ones aren't. “the”, “model”, and “token” are each a single token. But a long or unusual word — try supercalifragilisticexpialidocious — shatters into many pieces, because the tokenizer only learned merges for sequences it saw often during training.
A space is part of the token. Notice that “ tokenization” (with a leading space) is usually a different token from “tokenization” at the very start of the text. That little dot in the chips marks where a space lives — models think in “␣word” units far more than in bare words.
Numbers and code are weird. Try a big number like 1,000,000 or a date — they rarely split into neat digits, which is a big part of why models historically fumbled arithmetic. Code, whitespace, and non-English scripts all tokenize less efficiently too.
Why this matters in practice
Tokens are the unit of almost everything you care about. You're billed per token, in and out. A model's context window — how much it can “hold in mind” at once — is measured in tokens, not characters, so the chars-per-token ratio above directly decides how much of a document actually fits. And because the model reasons over tokens, quirks like the arithmetic stumbles above fall straight out of how text got split in the first place.
A useful rule of thumb for English prose: roughly 4 characters per token, or about ¾ of a word. But as you can see by pasting your own text, the only way to really know is to tokenize it.
More interactive modules are on the way — sampling & temperature, embeddings, and the agent loop. Back to all modules →