Quantization trades margin for miles per watt
Lower-precision weights shrink memory and speed up matmuls—while quietly shifting the loss landscape your benchmarks never noticed.
Why averages hide tail risk
Quantization maps floating-point tensors to fixed-point grids. Most tokens in most prompts are unaffected; rare tokenizations, long reasoning chains, or multilingual spans can spike error rates. Your leaderboard score may look flat while safety regressions appear—exactly where red-teaming should focus after compression.
Hardware and UX
On-device inference unlocks offline and privacy-sensitive workflows—see open weights vs API—but only if latency and quality remain acceptable; latency budgets tie directly to user trust.
Interaction with context length
Long context increases activation memory; quantization sometimes targets weights but not activations—profile end-to-end, not kernels alone.
Energy angle
Fewer bits per multiply means fewer joules per token—linking to energy as access for teams without hyperscaler budgets.
Quantization-aware training and QAT
Models trained with quantization in the loop (QAT) tolerate lower precision better than post-training quantization of FP32 checkpoints— especially for small architectures destined for edge deployment alongside open weights.
Mixed precision and layer sensitivity
Not all layers quantize equally; attention and layer norms are often more sensitive. Mixed-precision schemes keep critical ops in higher precision—profiling per layer avoids one-size-fits-all mistakes.