Artificial Intelligence / discovery / 3 MIN READ

MathAtlas Benchmark Exposes AI's Graduate-Math Formalization Ceiling

The best AI models can correctly formalize fewer than 1-in-10 graduate-level theorem statements — and that number collapses to 2.6% when the math gets deeply interconnected. MathAtlas just made that embarrassing gap impossible to ignore.

UPDATED 2026-05-18 / TIME HORIZON · mid term / ID · 8396D40F

Reality 72 /100

Hype 45 /100

Impact 68 /100

Explanation

Autoformalization is the task of converting human-written mathematics — theorems, proofs, definitions — into a formal language a computer can verify. It matters because verified math is the foundation of provably correct software, cryptography, and AI reasoning systems. Until now, most benchmarks tested AI on olympiad or undergraduate problems, the equivalent of judging a surgeon by their performance on a first-aid quiz.

MathAtlas changes the difficulty setting. Researchers extracted ~52,000 mathematical objects (theorems, definitions, exercises, examples, proofs) from 103 graduate-level textbooks and built a dependency graph of ~178,000 relations showing which concepts rely on which. That dependency layer is new — no prior autoformalization benchmark included it.

The results are a reality check. Strong baseline models top out at 9.8% correctness on theorem statements and 16.7% on definitions. On MA-Hard — a 700-entity subset with the deepest dependency trees — the best model manages just 2.6%. The harder the conceptual scaffolding, the faster performance falls off a cliff.

Why does this matter today? The AI field has been quietly overselling formal reasoning capabilities by benchmarking on problems that are, frankly, too easy. MathAtlas sets a credible bar for graduate and research-level mathematics, which is where formal verification actually needs to work to be useful in practice. Any lab claiming their model "does math" now has a much harder test to pass.

Watch for whether frontier model providers (OpenAI, Google DeepMind, Anthropic) engage with this benchmark publicly — silence would itself be a signal.

Autoformalization — translating natural-language mathematics into proof-assistant languages like Lean or Isabelle — has seen accelerating investment, but benchmark coverage has been systematically skewed toward competition mathematics (MATH, miniF2F) and early undergraduate content. MathAtlas addresses the coverage gap at scale: 52k entities drawn from 103 graduate textbooks, spanning the kind of abstract algebra, topology, and analysis that sits at the frontier of mechanized proof efforts.

The dependency graph (~178k relations) is the architectural differentiator. Prior benchmarks treat each statement as an isolated unit; MathAtlas encodes the DAG of conceptual prerequisites, enabling evaluation of whether a model can formalize a statement correctly given its full definitional context — or whether it degrades as that context deepens. The MA-Hard subset (700 entities, maximum dependency depth) operationalizes this: it's not just harder math, it's math whose correct formalization requires correctly resolving a long chain of prior formalizations.

Baseline performance numbers are stark. 9.8% on theorem statements and 16.7% on definitions represent the ceiling for current strong models under standard evaluation. The 2.6% figure on MA-Hard suggests that dependency depth is a near-total performance killer — consistent with known failure modes in LLMs around long-range coherence and symbol grounding across extended contexts.

Open questions the paper raises but doesn't fully resolve: What is the inter-annotator agreement on the "correctness" judgments? How sensitive are results to the choice of target formal language? Does retrieval-augmented formalization (feeding relevant dependency context explicitly) recover meaningful performance, or is the bottleneck in the model's reasoning rather than its context window?

The benchmark's release is the contribution; the experiments are illustrative rather than exhaustive. The field now has a credible upper-bound test. The next meaningful signal will be whether any system clears 20% on MA-Hard within 12 months — that would indicate genuine progress rather than benchmark overfitting.

Reality meter

Artificial Intelligence Time horizon · mid term

Reality Score 72 / 100

Hype Risk 45 / 100

Impact 68 / 100

Source Quality 75 / 100

Community Confidence 50 / 100

Why this score?

Trust Layer MathAtlas is a large-scale graduate-level autoformalization benchmark that reveals current AI models are far from capable of reliably formalizing research-level mathematics, especially when deep conceptual dependencies are involved.

Main claim

MathAtlas is a large-scale graduate-level autoformalization benchmark that reveals current AI models are far from capable of reliably formalizing research-level mathematics, especially when deep conceptual dependencies are involved.

Evidence

The benchmark contains ~52,000 theorems, definitions, exercises, examples, and proofs extracted from 103 graduate mathematics textbooks.
A mathematical dependency graph of ~178,000 relations is included — the first autoformalization benchmark to incorporate such relational structure.
Strong baseline models achieve at most 9.8% correctness on theorem statements and 16.7% on definitions.
On MA-Hard (700 entities with the deepest dependency trees), the best model achieves only 2.6% correctness.
The paper identifies that model performance degrades substantially as dependency depth increases.

Skepticism

The paper does not detail inter-annotator agreement or how 'correctness' is operationally defined, making it hard to assess whether the ceiling scores reflect model failure or evaluation noise.
Baselines are described as 'strong' without naming specific frontier models, limiting reproducibility and external comparison.
No retrieval-augmented or dependency-conditioned baselines are reported, leaving open whether the bottleneck is context access rather than reasoning capability.

Score rationale

Reality 72

The benchmark is grounded in concrete extraction from 103 real textbooks with quantified performance numbers, making the core empirical claims credible and reproducible.

Hype 45

The paper is measured in its claims — it presents a benchmark and reports results without overclaiming on model capabilities or future trajectories, keeping hype low.

Impact 68

Filling the graduate-math gap in autoformalization evaluation is a meaningful infrastructure contribution, but impact depends on community adoption and whether frontier labs engage with the benchmark.

Source receipts

1 source on file
Avg trust 90/100
Trust 90/100

Time horizon

Expected mid term

Community read

Community live aggregateIdle

Reality (article)72/ 100

Hype45/ 100

Impact68/ 100

Confidence50/ 100

Prediction Yes100%1 votes

Prediction votes1∑

Glossary

Autoformalization: The process of automatically translating mathematical statements written in natural language into formal code that can be verified by proof-assistant software like Lean or Isabelle.
Proof-assistant languages: Specialized programming languages (such as Lean and Isabelle) designed to help mathematicians write and verify formal mathematical proofs that a computer can check for correctness.
Dependency graph: A structured representation showing how mathematical concepts depend on one another, where nodes represent definitions or theorems and edges represent prerequisite relationships.
DAG: Directed Acyclic Graph — a network structure where connections flow in one direction without loops, used here to represent the hierarchical chain of mathematical prerequisites.
Symbol grounding: The ability of a language model to correctly connect abstract symbols (like mathematical notation) to their actual meanings and relationships across long passages of text.
Retrieval-augmented formalization: A technique where relevant mathematical context and prerequisite information is explicitly provided to a model to help it formalize statements more accurately.

Your signal

What's your read?

Your read shapes future topic weighting.

Quick vote

More rating options

Stars (1–5)

How real is this? Reality Ø 75

More or less of this?

Your vote feeds topic weights, community direction and future prioritisation. Open community direction

Sources

Tier 1 MathAtlas: A Benchmark for Autoformalization in the Wild arxiv.org 90

Optional Submit a prediction Optional: add your prediction on the core question if you like.

Prediction

Will any AI system achieve above 20% correctness on MathAtlas's MA-Hard subset within 12 months of the benchmark's release?

Yes100 %

Partly0 %

Unclear0 %

No0 %

1 votesAvg confidence 70

Explanation

Reality meter

Why this score?

Time horizon

Community read

Glossary

What's your read?

Sources

Prediction

Related transmissions

Nature Argues Human Judgment Remains Essential for Scientific Literature Reviews

Superconducting Qubits Deliver Certified Perfect Randomness From Weak Sources

Nature Calls Out Neuroscience's Broken Computer-Brain Metaphor

Acute Stress Disrupts Brain's Memory-Linking Circuitry, Blocking Insight