Representation

Why “understanding” in models is a geometry problem

Meaning as neighborhood structure in high dimensions—and why paraphrase, prompt drift, and retrieval each interact with that geometry differently.

From rows in a table to clouds of vectors

Large language models do not store facts as discrete records. They learn distributed representations: tokens and contexts map to points in a space where co-occurrence statistics pull related uses together. That is why synonyms cluster, analogies align along directions, and small prompt edits can slide completions across “valleys” of behavior you did not name explicitly.

Interpretability research probes these structures—sparse features, linear subspaces, causal interventions—not to claim human-like cognition, but to anticipate when benign edits become harmful or when capabilities appear suddenly along narrow directions. The practical upshot: robustness is often about the shape of data around the boundary, not only about aggregate loss.

Where retrieval re-anchors the geometry

Parametric memory alone cannot certify facts against the world. Retrieval-augmented generation injects external points—document chunks whose embeddings lie near a query—into the context window. The model still interpolates language, but now around anchored text. Failures often look like near-miss retrieval: the model answers confidently because fluency and correctness remain decoupled—see hallucination.

Training stages reshape the same space

Pretraining vs fine-tuning stacks a behavioral policy on top of a broad prior. Reinforcement-style tuning (discussed in RLHF) can change what the model says without erasing what it could say under different prompts or decoding settings—another face of geometry: multiple modes coexist in the distribution.

What to measure in products

Leaderboards rarely capture prompt sensitivity. Pair internal evaluations with adversarial paraphrase suites and domain-shifted inputs. For why benchmarks mislead, read benchmarks as proxies. If multimodal inputs matter, stitching issues in multimodal fusion can distort the effective “neighborhood” the model sees.

Linear probes and interpretability limits

Linear probes can predict supervised labels from activations— evidence that task-relevant information is linearly accessible—but "linearly decodable" does not mean the model "knows" the label in human terms. Causal scrubbing and path patching add stronger tests of what components actually use those directions.

Robustness beyond average-case loss

Adversarial training and contrastive data mixes reshape boundaries around fragile regions of input space—often more cost-effective than blind scale-up when failures concentrate on identifiable slices; connect to red-teaming and scaling trade-offs.