Test on your data — Physea Wiki

The decisive test is not a leaderboard. It is a small set of your own real cases, scored the way you define good. Building that evaluation set is the single most important step in choosing a model.

Once you have a shortlist, the way to decide between models is to run them against your own data, not a public score. Anthropic’s guidance is to create tests specific to your use case and notes that “having a good evaluation set is the most important step in the process,” then test with your actual prompts and data and compare accuracy, quality, and edge-case handling.^[1]

This evaluation set, often called an “eval,” is just a collection of real cases paired with what a good answer looks like. The point is that it reflects your work. Evidently AI draws the line clearly: a product eval is “more like a job performance review” that checks whether the system “excels in the specific task it was ‘hired’ for.”^[3]

You do not need a huge set to start. The highest-value move is to look at your own outputs and let the failures guide you. In Hamel Husain’s words, this bottom-up error analysis “is the single highest-ROI activity in AI development.”^[2] A few dozen real cases, scored honestly, will tell you more about which model to pick than any leaderboard can.

References

Choosing the right model — Anthropic
A pragmatic guide to LLM evals for devs — The Pragmatic Engineer (Hamel Husain)
LLM evaluation: a beginner's guide — Evidently AI

How do you test a model on your own data?

References