Inference

Quantization trades margin for miles per watt

Lower-precision weights shrink memory and speed up matmuls—while quietly shifting the loss landscape your benchmarks never noticed.

Why averages hide tail risk

Quantization maps floating-point tensors to fixed-point grids. Most tokens in most prompts are unaffected; rare tokenizations, long reasoning chains, or multilingual spans can spike error rates. Your leaderboard score may look flat while safety regressions appear—exactly where red-teaming should focus after compression.

Hardware and UX

On-device inference unlocks offline and privacy-sensitive workflows—see open weights vs API—but only if latency and quality remain acceptable; latency budgets tie directly to user trust.

Interaction with context length

Long context increases activation memory; quantization sometimes targets weights but not activations—profile end-to-end, not kernels alone.

Energy angle

Fewer bits per multiply means fewer joules per token—linking to energy as access for teams without hyperscaler budgets.

Quantization-aware training and QAT

Models trained with quantization in the loop (QAT) tolerate lower precision better than post-training quantization of FP32 checkpoints— especially for small architectures destined for edge deployment alongside open weights.

Mixed precision and layer sensitivity

Not all layers quantize equally; attention and layer norms are often more sensitive. Mixed-precision schemes keep critical ops in higher precision—profiling per layer avoids one-size-fits-all mistakes.