Benchmarks measure proxies
Leaderboards compress multidimensional quality into single numbers— useful for progress tracking, dangerous for deployment decisions.
What gets measured gets gamed
Narrow tasks with public tests invite overfitting—directly or via contamination in web-scale training data. Even honest teams feel pressure to optimize obvious scores. Complement public benchmarks with private holdouts, user-session replays, and domain-specific suites—especially after system changes like quantization.
Human baselines are not universal
“Human-level” claims hide which humans, under what time limits, with what tools. Similar issues appear in hallucination analysis: fluent errors fool experts too. Pair quantitative metrics with qualitative review for high-stakes domains.
Workflow beats leaderboard
Real tasks mix ambiguous instructions, partial observability, and multi-step tool use. Measure end-to-end outcomes, recovery from failure, and operator time—not only exact-match accuracy on clean prompts.
Safety evaluations are benchmarks too
Red-team suites ( red-teaming essay) can become stale as policies adapt. Treat them like regression tests, not trophies.
Contamination detection
N-gram overlap and embedding similarity between benchmark items and training sets help flag leakage—imperfect but better than ignoring the issue. Open evaluation consortia increasingly require decontamination reports for headline numbers.
Human evaluation protocols
Blind A/B tests with rubrics, inter-rater reliability, and stratification by expertise level beat single-number Elo on chat tasks—expensive but necessary for products claiming parity with professionals; overlaps with RLHF rating design.