Attention (mechanism)

Self-attention: pairwise relevance at scale

Attention weights route information between tokens—powerful and expensive; not a metaphor for human focus.

Mechanics

For each position, the model computes queries, keys, and values; softmaxed scores determine how much to blend other positions’ values. Stacked layers build increasingly global representations. This replaces many recurrent pathways with parallelizable ops—key to large-scale training.

Cost and context

Naive attention scales quadratically with sequence length—linking directly to token counts and context window budgets. Long-context models rely on kernels, flash attention, block-sparse patterns, or architectural approximations—each with trade-offs for latency and quantization paths.

Geometry and meaning

Attention patterns interact with the representation space described in understanding as geometry— steering which neighborhoods influence the next token prediction.

Multimodal variants

Cross-attention between modalities appears in many multimodal stacks—new seams, new failure modes.

Flash attention and IO-aware implementations

Reordering attention computation to reduce HBM traffic can dominate wall-clock wins on long sequences—"algorithm" and "kernel" are co-designed; see also latency.

Linear attention and subquadratic approximations

State-space models, Performer-style kernels, and other subquadratic layers trade exact pairwise attention for structured inductive biases—useful when context must grow very large on fixed hardware budgets.