Temperature
A control on randomness in token sampling—turn it down for steadiness, up for variety—always within the model’s learned distribution.
What temperature actually does
At each step, the model outputs logits converted to a probability distribution over the next token. Temperature rescales those logits before softmax: lower values sharpen the distribution (more deterministic); higher values flatten it (more exploration). It does not inject facts from nowhere—it reshapes which high-likelihood paths are reachable. That is why temperature interacts with hallucination: fluent falsehoods can remain high-probability under some prompts even at low temperature.
Creative work vs reliability
Brainstorming, fiction, and divergent ideation often benefit from higher temperature or alternative samplers (top-p, top-k). Customer support, medical triage copy, or legal drafting templates often need constrained, repeatable behavior—sometimes paired with retrieval; see grounding and RAG. The right knob depends on whether failure mode is “boring repetition” or “creative mistake.”
Policies, refusals, and sampling
Alignment training (e.g., via RLHF—see our RLHF essay) changes which completions are probable in sensitive categories. Temperature alone cannot substitute for policy design; it only modulates exploration around whatever the policy already permits.
Tokens and cost reminders
Sampling choices affect not only diversity but iteration patterns in agents that loop until a stop condition—impacting latency and spend. For token-level accounting, revisit Token and latency budgets.
Determinism for regression tests
Engineering teams often fix seeds and decoding settings in CI to catch regressions in prompts and tools. Remember that bitwise determinism may still vary across GPU drivers and batch sizes— assert on semantic invariants (tool calls, JSON shape) rather than exact strings when possible, especially alongside agent loops.
Product defaults and user mental models
Exposing temperature as a slider without guidance leads users to treat it as "creativity" when it is really a variance dial—sometimes worsening factual tasks. Sensible defaults, presets ("precise" vs "exploratory"), and inline explanations reduce misuse; pair with hallucination education in help docs.