MathAtlas Benchmark Exposes AI's Graduate-Math Formalization Ceiling
The best AI models can correctly formalize fewer than 1-in-10 graduate-level theorem statements — and that number collapses to 2.6% when the math gets deeply interconnected. MathAtlas just made that embarrassing gap impossible to ignore.
Explanation
Autoformalization is the task of converting human-written mathematics — theorems, proofs, definitions — into a formal language a computer can verify. It matters because verified math is the foundation of provably correct software, cryptography, and AI reasoning systems. Until now, most benchmarks tested AI on olympiad or undergraduate problems, the equivalent of judging a surgeon by their performance on a first-aid quiz.
MathAtlas changes the difficulty setting. Researchers extracted ~52,000 mathematical objects (theorems, definitions, exercises, examples, proofs) from 103 graduate-level textbooks and built a dependency graph of ~178,000 relations showing which concepts rely on which. That dependency layer is new — no prior autoformalization benchmark included it.
The results are a reality check. Strong baseline models top out at 9.8% correctness on theorem statements and 16.7% on definitions. On MA-Hard — a 700-entity subset with the deepest dependency trees — the best model manages just 2.6%. The harder the conceptual scaffolding, the faster performance falls off a cliff.
Why does this matter today? The AI field has been quietly overselling formal reasoning capabilities by benchmarking on problems that are, frankly, too easy. MathAtlas sets a credible bar for graduate and research-level mathematics, which is where formal verification actually needs to work to be useful in practice. Any lab claiming their model "does math" now has a much harder test to pass.
Watch for whether frontier model providers (OpenAI, Google DeepMind, Anthropic) engage with this benchmark publicly — silence would itself be a signal.
Autoformalization — translating natural-language mathematics into proof-assistant languages like Lean or Isabelle — has seen accelerating investment, but benchmark coverage has been systematically skewed toward competition mathematics (MATH, miniF2F) and early undergraduate content. MathAtlas addresses the coverage gap at scale: 52k entities drawn from 103 graduate textbooks, spanning the kind of abstract algebra, topology, and analysis that sits at the frontier of mechanized proof efforts.
The dependency graph (~178k relations) is the architectural differentiator. Prior benchmarks treat each statement as an isolated unit; MathAtlas encodes the DAG of conceptual prerequisites, enabling evaluation of whether a model can formalize a statement correctly given its full definitional context — or whether it degrades as that context deepens. The MA-Hard subset (700 entities, maximum dependency depth) operationalizes this: it's not just harder math, it's math whose correct formalization requires correctly resolving a long chain of prior formalizations.
Baseline performance numbers are stark. 9.8% on theorem statements and 16.7% on definitions represent the ceiling for current strong models under standard evaluation. The 2.6% figure on MA-Hard suggests that dependency depth is a near-total performance killer — consistent with known failure modes in LLMs around long-range coherence and symbol grounding across extended contexts.
Open questions the paper raises but doesn't fully resolve: What is the inter-annotator agreement on the "correctness" judgments? How sensitive are results to the choice of target formal language? Does retrieval-augmented formalization (feeding relevant dependency context explicitly) recover meaningful performance, or is the bottleneck in the model's reasoning rather than its context window?
The benchmark's release is the contribution; the experiments are illustrative rather than exhaustive. The field now has a credible upper-bound test. The next meaningful signal will be whether any system clears 20% on MA-Hard within 12 months — that would indicate genuine progress rather than benchmark overfitting.
Reality meter
Why this score?
Trust Layer MathAtlas is a large-scale graduate-level autoformalization benchmark that reveals current AI models are far from capable of reliably formalizing research-level mathematics, especially when deep conceptual dependencies are involved.
MathAtlas is a large-scale graduate-level autoformalization benchmark that reveals current AI models are far from capable of reliably formalizing research-level mathematics, especially when deep conceptual dependencies are involved.
- The benchmark contains ~52,000 theorems, definitions, exercises, examples, and proofs extracted from 103 graduate mathematics textbooks.
- A mathematical dependency graph of ~178,000 relations is included — the first autoformalization benchmark to incorporate such relational structure.
- Strong baseline models achieve at most 9.8% correctness on theorem statements and 16.7% on definitions.
- On MA-Hard (700 entities with the deepest dependency trees), the best model achieves only 2.6% correctness.
- The paper identifies that model performance degrades substantially as dependency depth increases.
- The paper does not detail inter-annotator agreement or how 'correctness' is operationally defined, making it hard to assess whether the ceiling scores reflect model failure or evaluation noise.
- Baselines are described as 'strong' without naming specific frontier models, limiting reproducibility and external comparison.
- No retrieval-augmented or dependency-conditioned baselines are reported, leaving open whether the bottleneck is context access rather than reasoning capability.
The benchmark is grounded in concrete extraction from 103 real textbooks with quantified performance numbers, making the core empirical claims credible and reproducible.
The paper is measured in its claims — it presents a benchmark and reports results without overclaiming on model capabilities or future trajectories, keeping hype low.
Filling the graduate-math gap in autoformalization evaluation is a meaningful infrastructure contribution, but impact depends on community adoption and whether frontier labs engage with the benchmark.
- 1 source on file
- Avg trust 90/100
- Trust 90/100
Time horizon
Community read
Glossary
- Autoformalization
- The process of automatically translating mathematical statements written in natural language into formal code that can be verified by proof-assistant software like Lean or Isabelle.
- Proof-assistant languages
- Specialized programming languages (such as Lean and Isabelle) designed to help mathematicians write and verify formal mathematical proofs that a computer can check for correctness.
- Dependency graph
- A structured representation showing how mathematical concepts depend on one another, where nodes represent definitions or theorems and edges represent prerequisite relationships.
- DAG
- Directed Acyclic Graph — a network structure where connections flow in one direction without loops, used here to represent the hierarchical chain of mathematical prerequisites.
- Symbol grounding
- The ability of a language model to correctly connect abstract symbols (like mathematical notation) to their actual meanings and relationships across long passages of text.
- Retrieval-augmented formalization
- A technique where relevant mathematical context and prerequisite information is explicitly provided to a model to help it formalize statements more accurately.
What's your read?
Your read shapes future topic weighting.
Your vote feeds topic weights, community direction and future prioritisation. Open community direction
Sources
Optional Submit a prediction Optional: add your prediction on the core question if you like.
Prediction
Will any AI system achieve above 20% correctness on MathAtlas's MA-Hard subset within 12 months of the benchmark's release?