Lexicon

Token

The atomic unit modern language models count, price, and attend over—usually a subword fragment, not a dictionary word.

Why “word” is the wrong mental model

Tokenizers split text into pieces that balance vocabulary size with coverage. Common words may map to a single token; rare or compound phrases may consume many. That is why two sentences with the same word count can differ radically in token length—and in API cost if your provider bills per token. When you read benchmarks or latency numbers, check whether they are reported in tokens or characters; the difference matters for non-Latin scripts and for languages with rich morphology.

Tokens also define the interface between human-readable text and model math. Every layer—embedding lookup, attention, feed-forward— operates on token indices. If you want to connect this idea to architecture, see our essay on self-attention and how pairwise relationships scale with sequence length.

Context windows are measured in tokens

A model’s context window caps how many tokens can be processed together in one forward pass. “Long context” is not infinite memory; it is a larger canvas on which attention (and memory cost) still applies. For a deeper look at trade-offs when inputs grow, read Context window and the note on retrieval-augmented generation, which often exists precisely because stuffing everything into context is impractical.

Sampling, temperature, and token-level choices

At generation time, models produce a probability distribution over the next token. Decoding strategies—greedy search, nucleus sampling, temperature scaling—reshape that distribution. Temperature does not change the world; it changes how aggressively the model explores low-probability continuations. Pair this article with Temperature and with Hallucination, because fluent sampling can look like knowledge when grounding is absent.

Practical checklist

  • Log token counts on real user inputs, not only on demo prompts.
  • Compare tokenizer behavior across languages you actually serve.
  • Treat “we fit the document” claims skeptically if the document is near the advertised limit—edge effects bite hardest there.

Special tokens and control codes

Most pipelines reserve indices for padding, unknown characters, beginning/end-of-sequence markers, and sometimes role tags (system, user, assistant). These tokens consume budget like any other but carry structural meaning—prompt templates that "look short" in a chat UI may hide dozens of control tokens. When comparing systems, inspect the rendered prompt after template expansion, not only the user-visible string. That discipline connects to agent traces where tool outputs are wrapped in additional scaffolding.

Migration and version pins

Tokenizer or vocabulary changes break reproducibility: the same string maps to different IDs, shifting logits and attention patterns. Pin tokenizer revisions in model cards and CI; treat tokenizer bumps as breaking changes unless you re-validate downstream tasks. For large-scale training dynamics, see scaling—data and tokenizer choices are inseparable from loss curves.