The Evaluation Gap

The evaluation problem is now harder than the modeling problem. Here is what we believe about the state of AI evaluation and why we started this company.

Public benchmarks have a half-life

Every public benchmark follows the same arc. It launches, captures real signal, becomes the metric labs optimize against, and then saturates or gets gamed into meaninglessness. SWE-bench went from 30% to over 80% on frontier models in under a year. Human-eval-style coding tasks are now trivially solved by small models. The benchmarks that mattered eighteen months ago tell you almost nothing today.

The problem is structural. The things that get measured become the things that get targeted. Once a benchmark is public, labs can engineer against it—through prompt optimization, few-shot stuffing, or training set contamination—without any corresponding improvement in general capability. Google’s Gemini 1.0 Ultra used 32 chain-of-thought examples per MMLU topic to claim state-of-the-art. The score moved. The model didn’t.

Good evals are the bottleneck, not good models

The frontier labs have said this themselves: teams with strong evaluation suites upgrade to new models in days. Teams without them take weeks, operating reactively—waiting for user complaints, reproducing issues manually, hoping fixes don’t introduce regressions. The cost of weak evaluation isn’t visible upfront. It compounds quietly until something breaks in production.

The harder problem is that writing good evals requires a different expertise than building models. It requires understanding what behaviors to test for, how to grade outcomes without penalizing creative solutions, and how to maintain signal as capabilities improve. This is a craft, not a checklist.

Models are now smarter than their evaluations

There is a documented pattern at every frontier lab: a model finds a better solution than the eval designer anticipated, and gets marked as a failure. Anthropic’s Opus found a superior approach to a flight booking task in tau-bench—one that was genuinely better for the customer—and was penalized for it. Their own CORE-Bench initially scored Opus at 42% due to rigid grading. After fixing the eval, the score was 95%.

This is the central irony. The more capable the model, the more likely it is to be mismeasured by static evaluations designed for less capable systems. Outcome-based grading—checking what happened rather than how it happened—is the only approach that scales with intelligence.

Intelligence and reliability are orthogonal

Smarter models are not necessarily more reliable. Independent measurement has shown there is no strong correlation between intelligence benchmarks and hallucination rates. A model can score at the top of reasoning benchmarks while confidently fabricating facts at a higher rate than a less capable competitor. These are different axes of quality, and most evaluation frameworks only measure one.

For deployment-critical applications, the question is not “how smart is this model” but “how does this model fail, how often, and in what contexts.” Public benchmarks almost never answer this.

No single evaluation layer catches everything

Safety engineering uses the Swiss Cheese Model: stack enough imperfect layers and failures that slip through one get caught by another. AI evaluation should work the same way. Code-based graders are fast and deterministic but brittle. Model-based graders handle nuance but hallucinate. Human review is the gold standard but doesn’t scale. The right answer is all of them, weighted by context.

Most teams use one or two layers at most. The failure modes they miss are exactly the ones that reach production.

External evaluation finds what internal evaluation cannot

Internal eval teams are constrained by their own assumptions. They build tests based on their mental model of how the system works, which means they systematically miss failure modes that don’t fit that model. This isn’t a talent problem. It’s an epistemological one. Red teams exist for security because insiders can’t fully adversarialize their own work. The same logic applies to capability evaluation.

Private, external evaluation suites also stay fresh in a way that public benchmarks cannot. They never appear in training data. They cannot be optimized against. Their signal doesn’t decay.

The field is early and moving fast

Multi-agent evaluation barely exists. Long-running task evaluation is nascent. Domain-specific evaluation for legal, medical, and financial applications is mostly ad hoc. The cost of running benchmarks with statistical rigor is exploding—95% confidence intervals require many repeated runs across an expanding set of models. Meanwhile, inference costs are dropping and total AI spend is rising, meaning more model behavior happening in more contexts with less systematic oversight.

The gap between what’s being deployed and what’s being measured is widening. We exist to close it.