Beliefs

What we believe about the state of AI evaluation.

The most important failures are often the rarest.

The behaviors that matter most are not always the ones that show up on broad benchmarks. The highest-cost failures are often subtle, infrequent, and easy to miss until they create trust, policy, or reputational damage.
Private evals are more useful than public benchmarks.

Frontier labs do not need more public leaderboard theater. They need private, targeted evaluations that reflect real deployment risk and preserve signal over time.
Evals should drive decisions, not just produce scores.

A good eval does more than measure model behavior. It helps teams decide whether a system is ready to ship, where it is fragile, and what needs to change before deployment.
High-signal beats broad coverage.

Coverage matters, but not all coverage is equally useful. We believe targeted eval families focused on specific, high-consequence behaviors are often more valuable than generic benchmark breadth.
Credibility comes from operational closeness.

The best evals are not designed in isolation. They come from understanding how frontier labs actually work: how models are deployed, how risks are reviewed, and how decisions get made under real constraints.
Narrow, high-stakes problems are worth building for.

Some of the most valuable infrastructure companies start by solving a specific problem for a small set of demanding customers. We believe that catching rare but damaging model failures is one of those problems.