How models are trained

A brief intro to pretraining, fine-tuning, and RLHF — how a pile of text becomes a model that follows instructions.

A chat model doesn't spring into existence knowing how to be helpful. It's built in stages, and each stage has a very different job — different data, a different objective, and a different thing it produces. Understanding those stages explains a lot of model behaviour: why a model knows so much yet sometimes ignores your instructions, or why two models with similar knowledge feel so different to talk to.

Here is the whole training loop in miniature: text is sampled from a giant corpus, tokenized into ids, pushed through the network, scored against the tokens that actually came next, and then the error flows backwards to nudge every weight. Press play to watch a few optimizer steps — the loss falls, the connection strengths change, and those weights are the model. Click any phase to read what's happening.

The training loop · live
Raw text corpus~trillions of tokensbatchTokenizerBPEtext → token ids46437977731319EmbedBlock 1Block NLogitsparameters (weights) live on every connectionNext-token predictionthe ← actualamyегоloss 9.20cross-entropy∇ gradientsOptimizerAdamW · lr 3e-4

Training data is a huge cleaned, deduplicated corpus — trillions of tokens of web text, books, and code. Every step samples a fresh batch of sequences from it.

Step0real runs: millions
Loss9.20
Model checkpoint0%the weights are the model

Zooming out, that same loop is run in three distinct training stages — different data, different objective, different result. Click through each to see what goes in and what comes out.

Walk the pipelineraw text → aligned assistant
Input dataA massive pile of internet text, books, and code — trillions of tokens, mostly unlabeled.
ObjectivePredict the next token, over and over, across everything it reads.
ProducesA “base model” that has absorbed grammar, facts, and patterns — but only knows how to continue text, not how to be helpful. Ask it a question and it might reply with more questions.
In a sentenceRead the whole library until you can finish almost any sentence.

Not every model uses all three stages, and the names vary — but this pretraining → fine-tuning → preference-tuning shape is the backbone of almost every modern chat model.

1 · Pretraining: learning language

The model is shown an enormous amount of text and given one relentless task: predict the next token. To get good at that, it has to implicitly learn grammar, facts, reasoning patterns, a little code, several languages — because all of those help it guess what comes next. The output is a base model: deeply knowledgeable, but not yet an assistant. It only knows how to continue text, so a question might just produce more questions.

2 · Supervised fine-tuning: learning to be helpful

Next, humans write a comparatively small set of high-quality prompt → ideal-answer examples, and the model is trained to imitate them. This is where it learns the format of being useful: follow the instruction, stay on topic, answer directly. The result is an instruct model — the first version that feels like it's actually responding to you.

3 · Preference tuning: learning taste

Finally, the model's answers are refined against human preferences. People rank competing answers, and the model is nudged toward the ones humans prefer — classically with a reward model and reinforcement learning (RLHF), or more directly with methods like DPO. This stage shapes the harder-to-specify qualities: helpfulness, honesty, harmlessness, tone, and knowing when to decline.

Why this matters

Most of what you experience as a model's “personality” comes from stages 2 and 3, while most of what it knows comes from stage 1. That split explains a lot: fine-tuning can make a model friendlier or safer without teaching it new facts, and a model can confidently state something wrong because next-token prediction rewards fluent, plausible text — not necessarily true text.


Up next in this track: sampling & temperature — how a trained model actually picks each token. Back to all modules →

© 2026 Rishabh Mehan · All rights reserved · Built with Next.js and a little stubbornness.