Scaling

Compute curves: what “bigger” actually buys you

Empirical scaling relationships hold until data, hardware, or task structure says otherwise—then different levers win.

Predictable regimes

For many transformer families, loss improves predictably with compute when data is abundant. That observation guided multi-year roadmaps—but plateaus appear when unique human-generated text runs thin or when task rewards are non-differentiable (tool use, long-horizon agency). Then retrieval, synthetic data with audits, or process supervision—not raw parameters—carry marginal gains.

Implications for evaluation

Larger models can reduce some errors while leaving hallucination structurally possible—see benchmarks.

Energy and access

Orders-of-magnitude compute spend raises climate and equity questions— energy note—and pushes interest in distillation and efficient inference ( quantization).

Geometry angle

Scale reshapes representation manifolds—connect to understanding as geometry and attention limits.

Chinchilla-style compute-data trade-offs

Empirical studies suggest many models were compute-optimal but data-suboptimal—training longer on more tokens at a given FLOP budget can beat simply growing parameters. Operationalizing this requires data pipeline investment, not only GPU purchases.

Infrastructure bottlenecks

Networking, storage I/O, and checkpointing can cap step time before matmul does—profiling clusters matters; ties to energy accounting per experiment.