Training

RLHF is alignment of behavior, not beliefs

Reinforcement learning from human feedback shapes policies toward rated preferences—without guaranteeing philosophical consistency or closing factual gaps.

The basic loop

Humans compare model outputs; a reward model estimates those preferences; policy optimization nudges generation toward higher predicted scores. Variants differ (PPO, DPO, IPO, and more), but the pattern is constant: optimize behavior under the raters’ distribution. That is why politeness can improve while factual accuracy stalls—raters may not exhaustively label every failure mode.

What sits underneath

RLHF does not rewrite the world model from scratch; it steers a base already shaped by pretraining. Capabilities unlocked earlier can remain accessible under adversarial prompts or tool chains—topics for red-teaming and agentic workflows.

Distribution shift is the silent adversary

When deployments differ from rating environments—new languages, longer contexts, enterprise jargon—policies can fray. Evaluation must track tail risks, not only average helpfulness; see benchmarks as proxies.

Sampling and policy interact

Temperature and other decoders explore different regions of the policy; “safety” is not independent of decoding choices. Grounding via RAG changes the evidence available at decision time, sometimes more than a marginal RLHF update.

Annotator disagreement and minority preferences

Raters disagree; majority voting bakes in dominant cultural defaults. Document disagreement rates, offer per-locale rating pools where product requires it, and audit whose preferences get amplified—threads in persistent questions.

Offline RLHF and on-policy drift

Training on model-generated chains (on-policy) tracks evolving behavior but risks instability; offline methods from static logs are safer but stale. Hybrid schedules and conservative updates are common mitigations—none remove the need for continuous evaluation under realistic benchmarks.