The Benchmark of One
Every time a frontier model drops, I run the same private experiment. Three tabs, the same prompt that's been giving me trouble that week, and a quiet half-hour reading the outputs side by side. The leaderboard already told me which one to use. The half-hour usually disagrees.
I spent two years working on Chatbot Arena and RouteLLM.[1][2] I'm familiar with what public benchmarks measure and what they don't. The thing I keep relearning is that "best on average" is the wrong question for almost any specific piece of work I'm actually trying to do.
The Private Ritual
Every few weeks a new model arrives, and the ritual repeats. I open three tabs, paste the same messy prompt I've been wrestling with that week, and read the outputs in parallel. The prompt is usually something where the public benchmarks would barely notice the difference: a research dump that needs to become a decision memo, a half-baked launch hook, a code review of a system the model has never seen.
One reads too smooth, another misses the point entirely, and the third has a sentence I would never have written. Somehow that one moves the whole thing forward. So I switch, at least for now.
The problem is that the evidence disappears. I make the judgment, tell a friend maybe, and move on. A month later the model has changed, my work has changed, and I am back to intuition without memory. The half-hour was useful. It just didn't leave anything behind.
The Leaderboard Was a Shortcut
Benchmarks were useful because they compressed judgment. I didn't have to test every model myself. Someone else chose the task set, ran the systems, published the scores, and gave me a rough map.
That worked while the benchmark was hard to game, slow to saturate, and close enough to the work people cared about. Those conditions are weaker now. Labs train against public evals. Scores become launch assets. Some tests drift so far from daily work that the number keeps climbing while the experience gets worse. A model can win the leaderboard and still be the wrong tool for the essay I am trying to write, the code review I am trying to do, or the launch I am trying to ship.
This isn't unique to AI. Most public metrics share the same failure mode: they measure what was easy for someone else to count, then drift further from the thing you actually care about every quarter they stay in production. The score keeps climbing while the thing it was a proxy for quietly stops mattering.
The answer is to write the measurement yourself.
A Rubric Is Compressed Taste
Taste is usually described as a feeling, but the feeling is built from judgments that can be named. A rubric is what you get when you write those judgments down before the result has a chance to persuade you.
This is why benchmarking sits downstream of taste. In Taste Was Always the Job, I argued that as AI makes execution cheaper, judgment becomes more important.[3] A private benchmark is one way of preserving that judgment after the moment has passed. It turns a preference into an instrument you can rerun.
For a founder working with AI every day, the questions aren't academic:
- Did this sharpen the decision, or did it merely summarize the situation?
- Did it preserve the voice, or sand it into competent neutrality?
- Did it catch the real engineering risk, or invent a risk-shaped sentence?
- Did it understand the product, or optimize for a generic conversion pattern?
- Did it make the next move clearer?
Most weeks the question that catches a model out is the third one. The risk-shaped sentence. A model will produce a paragraph about "potential concerns around scalability" or "data consistency tradeoffs" that pattern-matches on what code review usually looks like, without ever touching the class actually causing the problem. The rubric is what lets me tell the difference between a review and the shape of one.
A good rubric records enough of the half-hour that next month I can argue with my past self.
The First Version Is a Practice
A private benchmark starts as a practice.
When a new model lands, I open a folder of work I already trust: essays I've shipped, code reviews I've written, decision memos from real launches, prompts that have failed me in interesting ways. I run the new model against each one, read the outputs alongside my own answer, and write down what surprised me, what felt wrong, and which of the five questions above the model handled or dodged.
That habit is the one thing two years on Chatbot Arena and RouteLLM left me with. LM Arena worked because crowdsourced pairwise preference, at scale, is a strong signal about general capability.[1] Millions of votes on a public board do an honest job of telling you which model is more useful on the median request.
RouteLLM was the next step. Once you had preferences over real tasks, you could route a specific prompt to the model most likely to satisfy them, instead of paying the strongest model to do every job.[2] Both projects were ways of trading averages for fit, at progressively smaller scales.
A private benchmark is the same trade at N=1. The corpus is small, the rubric is mine, and the judge is sometimes a model and sometimes me. The output is a map of where each model belongs in my work this month, written in language strict enough that I can rerun it next month and see what changed.
If the benchmark disagrees with me, that's where calibration begins. Maybe the rubric is wrong, the task set is shallow, or my intuition is out of date. Any of those is useful to find out.
What It Measures
The leaderboard isn't going away. It's going to keep getting more precise about questions I'm not asking. The work is to write down the question I am asking, in language strict enough that next month I can rerun it and tell whether the answer changed because the model got better, or because I did.
References
- Chatbot Arena (LMSYS / LM Arena). Crowdsourced pairwise preference evaluation for LLMs. ICLR 2024.
- RouteLLM. Routing prompts between weaker and stronger models using preference data.
- Taste Was Always the Job. Prior essay on taste as the bottleneck when AI makes execution cheap.