Latency budgets decide what “intelligent” feels like
Benchmarks rarely capture waiting: humans build trust from partial results, predictable pacing, and recoverable failures.
Time to first token matters
Streaming tokens to the UI beats staring at a spinner—even when final quality is identical—because users can steer early. That shifts engineering priority toward prefill optimization, batching, and model routing: small models for “easy” turns, larger ones for complex reasoning. Token economics tie directly to token counts and context length.
Chains of tools amplify delay
Agents that call search, code execution, or enterprise APIs pay network and cold-start costs serially unless parallelized. Reliability strategies in tool-using agents must include latency budgets per hop, not only success rates.
Quantization trades quality margins for speed
On-device and edge inference often rely on reduced precision—see quantization—which can shift tail behavior. Fast-but-wrong is worse than slow-but-right for certain tasks; choose per workflow.
Energy and fairness
Latency intersects with who can afford to run systems at scale. Read energy as part of UX for why efficiency is an access issue, not only a climate one.
Client-side buffering and perceived speed
Chunked HTTP responses, progressive rendering of markdown, and skeleton UIs improve perceived latency even when server time is fixed—important for mobile networks and for energy-constrained devices that throttle CPUs.
SLIs for AI features
Define service level indicators: p95 time-to-first-token, error rate on tool calls, and rollback triggers when latency regresses after a release. Traditional SRE practice applies; the novelty is stochastic generation length—budget for worst-case chains in agent flows.