Pretraining vs fine-tuning
Two stages, two objectives: broad statistical modeling first; task-shaped behavior second—often including preferences and safety policies.
What pretraining optimizes
Pretraining pushes a model to compress patterns in large text corpora: syntax, genre conventions, factual associations (with no guarantee of truth), and shallow reasoning regularities that emerge from next-token prediction. The result is a powerful generative prior—useful, but not inherently aligned with instructions, refusals, or citation discipline. Scale shifts which skills emerge first; it does not automatically inject judgment. For how scaling interacts with marginal returns, see compute curves and scaling laws.
What fine-tuning changes
Instruction tuning, supervised fine-tuning (SFT), and preference optimization reshape behavior: tone, format, tool use, and refusal boundaries. Reinforcement learning from human feedback (RLHF) is one popular family, but not the only one. The key conceptual point: fine-tuning aligns behavior in situations humans rated, not a complete moral theory. For nuance on what RLHF does and does not promise, read RLHF: alignment of behavior, not beliefs.
Fine-tuning can also specialize models to domains—medicine, law, coding—where the base model’s priors are too generic. That specialization can improve metrics while leaving rare failure modes intact; evaluation must follow the domain, not the leaderboard.
Where harms can originate
Toxic or biased generations can reflect pretraining data, under-specified fine-tuning objectives, or mismatches between deployment contexts and rating sets. Privacy risks can arise when fine-tuning memorizes user content—see fine-tuning on conversations. Synthetic data loops, used to augment training, can amplify systematic errors unless carefully filtered—discussed in synthetic data: amplifier or echo chamber.
How to read product claims
When a vendor says “we aligned the model,” ask which stage changed, with what data, and against what adversarial evaluation—not only whether the chatbot sounds polite. Politeness and correctness are different axes; benchmarks often measure only one. Our essay on benchmarks as proxies explains why leaderboard scores diverge from messy workflows.
Continued pretraining and domain adaptation
Teams often insert a "continued pretraining" phase on curated corpora before instruction tuning—useful when generic web text under-represents a domain (medicine, law, codebases). The risk is catastrophic forgetting if learning rates or data mixes are wrong; evaluation on both general and specialist tasks is mandatory.
Parameter-efficient fine-tuning
Adapters, LoRA, and prefix tuning update small parameter subsets— lowering cost and making per-tenant customization feasible. They also change failure modes: multiple LoRAs composed poorly can interfere; routing and capacity planning become product concerns, not notebook details—overlapping open-weights debates about who may ship which adapter on whose base model.