Lexicon

Context window

The maximum contiguous span of tokens a model can attend to in one pass—an engineering ceiling with cognitive implications, not “unlimited memory.”

Mechanics in one breath

In standard transformer blocks, each token can attend to every other token within the window. That pairwise flexibility enables long-range dependencies, but it also implies growing compute and memory as sequences lengthen—often quadratically for full attention unless mitigated by sparse patterns, linear approximations, or hardware-specific kernels. For the underlying operation, read self-attention at scale.

What longer windows change for users

A wider window lets you paste more code, transcripts, or policy text into a single prompt—reducing manual chunking. It does not guarantee faithful recall: models can still drift, summarize incorrectly, or attend unevenly across positions (“lost in the middle” phenomena reported in several studies). Retrieval systems remain relevant because relevance—not raw length—often limits quality. Compare with RAG as a contract between memory and honesty.

Error chains and noise amplification

Long contexts compound small mistakes: a mis-parsed table row or an ambiguous pronoun can steer later tokens confidently off course. When user-provided documents are noisy (OCR glitches, HTML boilerplate), “fitting the whole file” can mean fitting the wrong emphasis. Mitigations include cleaning pipelines, structured excerpts, and explicit instructions to quote sparingly—tied to grounding and hallucination dynamics.

When to stop stuffing context

If your task requires authoritative facts, external retrieval or tools often beat heroic context lengths. If your task requires latency-sensitive UX, remember that longer prompts increase prefill cost—see latency budgets and perceived intelligence.

Soft attention and positional biases

Even inside the window, models may not attend uniformly—"lost in the middle" and positional biases mean critical facts buried in the center of a long prompt can be under-used. Mitigations include repetition, structured summaries, and moving key constraints near the start or end per empirical probing—paired with retrieval design.

Sliding windows and hierarchical summaries

When documents exceed the window, sliding windows with overlap, map-reduce summarization, or recursive condensation trades recall for compute. Each pattern introduces new failure modes (inconsistent summaries across chunks); integration tests should span chunk boundaries—see benchmarks for why naive exact-match tests miss this.