The number everyone's quoting is the wrong one

Anthropic just said AI is starting to improve itself. The metric that matters is the one they buried — the thing it still can't do.

Jun 04, 2026

Anthropic published “When AI Builds Itself” today, and the internet locked onto a single figure: a model that turned a 4x human code-optimization into 52x. It’s a real result. It’s also the least interesting sentence in the piece.

The sentence that matters is the one they almost apologize for: large performance gaps persist when it comes to “exercising judgement in choosing goals.” Strip the hedging and it says the thing builders have felt for a year — these systems are becoming extraordinary at executing a task you’ve specified well, and they are still weak at deciding which task is the right one, and weaker still at knowing when to stop.

That’s not a footnote. It’s the whole shape of the risk. Because the other half of their data is that execution has basically been solved at their scale: more than 80% of the code Anthropic merges is now written by Claude. When the writing is free and the judgment is scarce, the bottleneck moves. It moves to review, to verification, to oversight — to the unglamorous question of which of these thousand autonomous actions was the one that shouldn’t have happened. Anthropic names the need directly: “tools for oversight, validation, and verification.” Their own engineers’ code review is becoming the constraint.

I’ve spent the last year building toward exactly that gap, as a live system rather than a manifesto. The shape of it: autonomous agents that hold their own wallets and act on their own — but every consequential action passes through a governance gate that returns a verdict before it ships, weighed against the capital at risk; produces a signed, independently-verifiable attestation after; and hits a hard human-in-the-loop wall on anything irreversible. Agents scaffold their autonomy; the governance layer carries the judgment they don’t yet have.

I want to be precise about what I’m not saying. I haven’t solved alignment. Nobody has. The claim is narrower and, I think, more useful: the under-built layer in all of this is not raw capability — the labs are handling that, faster than they expected. It’s the oversight and verification infrastructure they are now, in writing, asking the rest of us to build. That’s a strange and clarifying thing for a frontier lab to admit out loud.

If you’re working on agent evaluation, governance, or assurance — or hiring for it — I’d like to compare notes. I’ve been living in this gap.

— building in the open at babyblueviper.com

Baby Blue Viper

Discussion about this post

Ready for more?