Hallucination
Confident outputs that are unsupported, contradicted by sources, or simply fabricated—often a symptom of missing grounding, not mere “randomness.”
Why fluency masks factuality
Language models optimize for plausible continuation. When external constraints are absent, the model still produces high-likelihood text—citations, case law, statistics—that “sounds right” because it matches genre patterns learned during pretraining. That is why errors can look authoritative: the objective was never truth in the philosophical sense. For the statistical picture of meaning as geometry, see understanding as geometry.
Grounding changes the game—if engineered
Grounding ties generations to verifiable evidence: databases, retrieved passages, tool outputs. It does not eliminate errors; it shifts them to retrieval misses, ranking mistakes, or conflicts the model fails to surface. Our longer piece on RAG walks through chunking, hybrid search, and explicit “not found” behaviors. Also read Grounding for the stack-level view.
Evaluation: averages lie in the tails
Aggregate accuracy on QA sets can hide catastrophic failures on rare but high-stakes queries. Teams serious about reliability pair open-domain tests with adversarial probes—see red-teaming beyond rude questions—and domain-specific spot checks. Benchmark culture often rewards short prompts; real users write messy ones— benchmarks as proxies discusses the gap.
Interface design matters as much as weights
When UIs present AI text with the same visual authority as verified data, users over-trust—especially under time pressure. Friction, provenance, and editable intermediate steps are safety features. We explore this in automation bias and interface confidence.
Attribution and citation fidelity
In grounded systems, a distinct failure is "right conclusion, wrong source"—the model cites document A while the supporting span lives in B. Scoring citation overlap between answers and retrieved passages catches this better than ROUGE alone; human rubrics still help for nuanced domains, as discussed under grounding.
Confidence scores users can trust
Well-calibrated probabilities are rare in generative models; showing raw logits confuses users. Some products instead expose retrieval scores, explicit abstention, or "two-step" flows where the user picks among retrieved passages before generation—reducing blind trust—aligned with automation bias concerns.