Why benchmarks mislead

Public benchmarks are a useful first filter but a weak final answer. Scores can be inflated by training on test data, and they measure broad skills rather than your specific inputs. Use them to narrow, then test on your own cases.

Public benchmarks are worth a glance, but a high score is a weak promise. They are good for a first cut and bad as a final answer.

One problem is that benchmark scores can be inflated rather than earned. A 2025 review of the field argues that “benchmark performance should not be used as a reliable indicator of general LLM cognitive capabilities,” partly because models can pick up the test questions during training and partly because strong scores often fail to carry over to real tasks.^[1] When a model has effectively seen the answer key, the number tells you about memory, not skill.

The other problem is fit. Standardized benchmarks “test broad capabilities, not the specific inputs your system might handle.”^[2] A model can top a reasoning chart and still stumble on the particular phrasing, format, and edge cases of your work. So treat benchmarks as a way to build a shortlist, then settle the choice the way the previous page describes: run the candidates on your own cases and see which one actually does your job.

References

Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models — James Fodor, arXiv 2025
LLM evaluation: a beginner's guide — Evidently AI

Why are public benchmarks not enough?

References