Modelwire
Subscribe

New math benchmark reveals AI models confidently solve problems that have no solution

Illustration accompanying: New math benchmark reveals AI models confidently solve problems that have no solution

A new 439-task mathematics benchmark exposes a critical blind spot in frontier AI systems: while scaling compute improves problem-solving ability, it does nothing to help models recognize when a task is fundamentally unsolvable. Google's Gemini 3 Pro achieves 30 percent on research-grade problems but no model exceeds 50 percent accuracy on the 99 deliberately broken tasks embedded in SOOHAK. This gap between raw capability and epistemic honesty matters for deployment, suggesting that current scaling approaches may not address the reasoning robustness required for high-stakes applications where false confidence is costlier than admitting uncertainty.

Modelwire context

Explainer

The benchmark's core design insight is that unsolvable problems aren't a stress test of raw math ability but a probe of metacognition: does the model know what it doesn't know? Most evaluations never ask that question, so the industry has been optimizing for answer quality while leaving answer-refusal capacity almost entirely unmeasured.

This is largely disconnected from recent activity in our archive, as we have no prior coverage to anchor it to. It does, however, belong to a growing body of work questioning whether capability benchmarks and reliability benchmarks are measuring the same thing. The distinction matters because a model that scores well on solvable problems but fabricates solutions to broken ones is not simply 'less capable,' it is miscalibrated in a way that standard leaderboard rankings obscure entirely. That miscalibration is precisely what makes deployment in legal, financial, or scientific contexts risky.

Watch whether any major lab responds to SOOHAK by publishing refusal-rate data alongside accuracy scores in their next model card. If that becomes a reporting norm within the next two release cycles, the benchmark will have done real work; if labs stay silent, the gap it exposes will persist quietly.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsSOOHAK · Google · Gemini 3 Pro · The Decoder

MW

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Modelwire summarizes, we don’t republish. The full content lives on the-decoder.com. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

New math benchmark reveals AI models confidently solve problems that have no solution · Modelwire