Compute curves: what “bigger” actually buys you
Empirical scaling relationships hold until data, hardware, or task structure says otherwise—then different levers win.
Predictable regimes
For many transformer families, loss improves predictably with compute when data is abundant. That observation guided multi-year roadmaps—but plateaus appear when unique human-generated text runs thin or when task rewards are non-differentiable (tool use, long-horizon agency). Then retrieval, synthetic data with audits, or process supervision—not raw parameters—carry marginal gains.
Implications for evaluation
Larger models can reduce some errors while leaving hallucination structurally possible—see benchmarks.
Energy and access
Orders-of-magnitude compute spend raises climate and equity questions— energy note—and pushes interest in distillation and efficient inference ( quantization).
Geometry angle
Scale reshapes representation manifolds—connect to understanding as geometry and attention limits.
Chinchilla-style compute-data trade-offs
Empirical studies suggest many models were compute-optimal but data-suboptimal—training longer on more tokens at a given FLOP budget can beat simply growing parameters. Operationalizing this requires data pipeline investment, not only GPU purchases.
Infrastructure bottlenecks
Networking, storage I/O, and checkpointing can cap step time before matmul does—profiling clusters matters; ties to energy accounting per experiment.