Self-attention: pairwise relevance at scale
Attention weights route information between tokens—powerful and expensive; not a metaphor for human focus.
Mechanics
For each position, the model computes queries, keys, and values; softmaxed scores determine how much to blend other positions’ values. Stacked layers build increasingly global representations. This replaces many recurrent pathways with parallelizable ops—key to large-scale training.
Cost and context
Naive attention scales quadratically with sequence length—linking directly to token counts and context window budgets. Long-context models rely on kernels, flash attention, block-sparse patterns, or architectural approximations—each with trade-offs for latency and quantization paths.
Geometry and meaning
Attention patterns interact with the representation space described in understanding as geometry— steering which neighborhoods influence the next token prediction.
Multimodal variants
Cross-attention between modalities appears in many multimodal stacks—new seams, new failure modes.
Flash attention and IO-aware implementations
Reordering attention computation to reduce HBM traffic can dominate wall-clock wins on long sequences—"algorithm" and "kernel" are co-designed; see also latency.
Linear attention and subquadratic approximations
State-space models, Performer-style kernels, and other subquadratic layers trade exact pairwise attention for structured inductive biases—useful when context must grow very large on fixed hardware budgets.