A team at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) has unveiled a new neural reasoning architecture that achieves unprecedented performance on graduate-level mathematics, scoring 94% on the MATH benchmark — surpassing the previous best of 86% set just four months ago.

The system, called VerifyChain, combines large language model reasoning with a formal proof verification layer. Unlike previous approaches that relied solely on scaling chain-of-thought prompting, VerifyChain introduces a novel “reason-then-verify” loop where each intermediate step in a mathematical derivation is checked against a lean proof assistant before the model proceeds.

How It Works

“The key insight is that language models are remarkably good at generating plausible proof steps, but they hallucinate at a rate that makes them unreliable for formal mathematics,” said Dr. Elena Vasquez, the project’s lead researcher, in an interview with The Meridian. “By adding a verification layer, we get the creative reasoning of a language model with the rigor of a proof assistant.”

The architecture processes problems in three phases:

  1. Problem decomposition: The language model breaks the problem into subgoals, identifying which mathematical domains are relevant.
  2. Guided reasoning: For each subgoal, the model generates candidate proof steps, which are immediately verified by a Lean 4 backend.
  3. Synthesis: Verified steps are composed into a complete solution, with the model providing natural-language explanations alongside the formal proof.

Benchmark Results

On the MATH benchmark — a collection of 12,500 competition mathematics problems spanning algebra, geometry, number theory, and combinatorics — VerifyChain scored 94.2%, up from the previous state of the art of 86.1%. More striking was its performance on the hardest difficulty tier (Level 5), where it achieved 87%, compared to 71% by the next-best system.

The team also evaluated VerifyChain on the newly released GradMath benchmark, which contains problems drawn from graduate-level qualifying exams in pure mathematics. Here the system scored 72%, compared to an average of 68% among human graduate students who took the same exam under timed conditions.

Implications and Limitations

The researchers are careful to note that VerifyChain’s strong performance does not mean AI has “solved” mathematics. The system still struggles with problems requiring novel constructions or creative leaps that go beyond known techniques.

“This is a tool for mathematicians, not a replacement for them,” Dr. Vasquez emphasized. “The most exciting application is in proof exploration — helping researchers check whether an approach to an open problem is viable before investing weeks of effort.”

The system also requires significant computational resources. Each MATH benchmark problem took an average of 45 seconds to solve, compared to 2 seconds for standard language model approaches, due to the overhead of formal verification.

The team plans to release the model weights and verification infrastructure as open source within the next month.