Mathematics in the Age of AI

In five years, mathematics went from pen-and-paper proofs to AI agents that disprove published papers and contribute to Mathlib. A personal account of a transformation we're only beginning to understand.

Mathematics AI Lean Formal Verification Education

2019: The Old World

In 2019, I was a teaching assistant for a first-year bachelor mathematics course at the University of Amsterdam. The workflow was ancient: students wrote solutions by hand or typeset them in LaTeX, I read every line, marked errors, provided feedback, and assigned a grade. If a student was stuck, they came to office hours. If I was stuck grading, I re-read the textbook.

The entire chain — from learning to writing to grading — was human all the way down. This was unremarkable. It was simply how mathematics worked. It had worked this way for centuries.

2021: My Bachelor Thesis

When I wrote my bachelor thesis in 2021, GPT-3 existed but had no meaningful presence in academic work. The workflow was:

Writing: Every sentence was mine. I drafted, revised, restructured, and proofread the entire document. Proper references meant manually searching Google Scholar, reading abstracts, tracking down PDFs, and formatting citations by hand.
Code: All of it was written by me, line by line. When I hit a bug, I debugged it by reading Python documentation, searching Stack Overflow, adding print statements, and staring at the screen. There was no tool I could ask “why does this function return None here?”
Understanding: If I wanted to implement an algorithm, I needed to fully understand it first. There was no shortcut. I’d read the paper, work through the math on paper, then translate it into code. A small increment in my project — adding one feature, fixing one approach — routinely took days.
Questions: When I had a specific technical question, my options were: ask my supervisor (if available), Google it (if the question was common enough), or read the documentation (if it existed). Many questions simply went unanswered until I figured it out myself.

This was the world. It was slow, it was laborious, and everyone accepted it as normal.

2021–2025: The In-Between

Between my bachelor thesis and returning to university, I worked as a software developer. Pre-GPT-3 in practice, pre-ChatGPT entirely. Every line of code was human-written. Every bug was human-debugged. Every architectural decision was discussed in meetings between humans.

When I left this world and returned to academia in 2025, it felt like stepping through a portal.

2025: A Different Classroom

In 2025, I was again a teaching assistant — same university, same department, similar first-year course. The difference was stark.

LLMs were everywhere. Students used them to draft solutions, check their reasoning, generate LaTeX, explore alternative proof strategies, and get instant explanations of concepts they didn’t understand. The AI was, in many ways, a better private tutor than I had ever been — infinitely patient, available at 3 AM, and capable of explaining the same concept ten different ways until something clicked.

But it created a fundamental problem: I could never know how much of a homework submission was the student’s own thinking. Was this proof written by a student who understood every step? Or was it generated by an LLM and lightly edited? The work looked the same either way.

I wouldn’t discourage LLM usage — that would be fighting the tide, and besides, learning to work with AI is a genuine skill. An LLM as a study tool is powerful: it can explain why a proof attempt fails, suggest different approaches, help with the mechanical parts of LaTeX formatting so students can focus on the mathematics itself.

But it meant the homework assignments, as traditionally conceived, no longer served their original purpose as a reliable measure of understanding. In the end, the real assessment happened in the written exam — where no LLM could help.

This shift happened in roughly three years. The entire incentive structure and pedagogy of university mathematics education was disrupted, and most institutions are still figuring out how to respond.

2026: Lean, Causality, and an AI That Disproves Papers

Now, in 2026, I’m doing my master thesis in mathematics at the University of Amsterdam. The topic: using Lean to formally verify proofs in causality theory.

If 2019 was the old world and 2025 was the transition, 2026 is something else entirely. Two anecdotes illustrate where we are.

An AI Finds a Counterexample to a Published Paper

A professor at the UvA had written a paper and wanted to formalize his proof in Lean. But formalization in Lean is demanding — you don’t just prove your own theorem, you need to prove (or import proofs of) every theorem your proof depends on, all the way down to the axioms.

His proof relied on a theorem from another published paper. So he asked Claude Code to formalize that other paper’s result in Lean.

Claude Code came back with an unexpected response: “I think this proof is not correct. Here is a counterexample.”

The professor checked. The counterexample was valid. An AI had disproved an already published mathematics paper — not by finding a logical error in the text, but by attempting to build the formal dependency chain and discovering it couldn’t be completed because the claimed result was false.

Fortunately, the authors were able to find an alternative path and still prove the main theorem from the professor’s paper. But the implication was striking: an AI, by attempting rigorous formalization, had caught an error that human peer review had missed.

An AI Contributes to Mathlib

In my own work with Lean, I experienced something equally remarkable but in a constructive direction.

Proving theorems in Lean is fundamentally different from proving them on paper. Lean is built on intuitionistic type theory (ITT), which has different logical foundations than the classical ZFC set theory that most mathematics papers use. Definitions that are standard in ZFC need to be reformulated for ITT. Proof strategies that work on paper may not work in Lean’s type system.

I needed some definitions and basic lemmas formalized in Lean that had never been done before in this system — they existed in the mathematical literature, formulated in ZFC, but no one had translated them into Lean’s type theory and proven the corresponding results.

I asked Claude Code to do it.

It generalized what needed to be done and executed it perfectly. It formulated the definitions in ITT, stated the lemmas, and proved them — all verified by Lean’s compiler. The AI contributed genuinely new formalizations to mathematics. Definitions and theorems that had never existed in intuitionistic type theory before were now proven, verified, and ready to be added to Mathlib.

The key advantage of Lean in this workflow is that it provides immediate, rigorous feedback. The AI doesn’t need a human to check its work — Lean either accepts the proof or rejects it. This creates a tight, fast loop: attempt a proof, get feedback, adjust, try again. The AI excels at this kind of rapid iteration.

The Future: Autonomous Mathematical Research

These anecdotes point toward a workflow that is rapidly becoming feasible — and that may soon become the dominant mode of mathematical discovery.

The Autonomous Math Research Loop

LLM Research AgentGenerates hypotheses · Writes proofsLean CompilerVerify or rejectMathlib + ResultsProven theorems · AxiomsHypothesis EngineFormulate conjecturesAgent 1Proof attempt AAgent 2Proof attempt BAgent 3Proof attempt Csubmit proof✓ / ✗read theoremsverified → save to librarygenerate

Agents work in parallel, share results, try different strategies

The autonomous LLM math researcher of the near future:

Connected to Lean. The AI writes proofs and gets instant, rigorous verification. No human needs to check the output — the compiler is the judge. If the proof type-checks, it’s correct. Period.
Generates hypotheses. Based on existing theorems and the structure of the DAG, the AI formulates conjectures — new statements that might be true and would be interesting if proven.
Tries, tries, and tries. The AI attempts proofs. Most will fail. That’s fine — Lean’s fast feedback loop means each failed attempt provides information. The AI adjusts its strategy and tries again.
Works in parallel. Multiple agents tackle the same conjecture from different angles simultaneously. One might try induction, another a contradiction argument, another a constructive approach. They share intermediate results.
Saves everything. Every successful proof is stored, verified, and made available as a building block for future proofs. When something is proven in Lean, it becomes an axiom-like building block — a trusted node in the DAG that all future work can depend on.
Grows the library. The process continuously extends Mathlib, adding definitions, lemmas, and theorems. The mathematical frontier expands autonomously.

This isn’t science fiction. Every component of this pipeline exists today. Lean provides the verification. LLMs provide the proof generation. Agent frameworks provide the orchestration. What’s missing is the integration and scale — and those are engineering problems, not research problems.

The S-Curve of Mathematics Itself

For most of history, mathematics advanced at the pace of human thought. Euclid proved his theorems with compass and straightedge. Newton invented calculus with pen and paper. Even in the 20th century, when the pace of publication accelerated dramatically, every proof was still written, checked, and published by humans.

We don’t know where we are on this curve. We might be at the very beginning of the steep part — in which case the next decade will produce more verified mathematics than the previous two millennia combined. Or the challenges of formalizing deeper mathematics might slow things down, creating a more gradual curve.

But the direction is unmistakable. And the feedback loop — AI proves theorems, which extend the library, which gives AI more tools for proving harder theorems — has no obvious ceiling.

A Prediction

I’ll make a concrete prediction: by the start of 2028, the majority of new formally verified mathematics will be produced by LLMs rather than humans.

I give this an 80% probability.

This doesn’t mean humans become irrelevant to mathematics. Humans will still formulate the interesting questions, set research directions, provide intuition about what’s worth proving, and make the creative leaps that connect distant branches of mathematics. But the mechanical work of proof construction — the laborious, detail-oriented process of building the logical chain from axioms to theorem — will increasingly be done by machines.

This has happened before in other fields. Humans still design buildings, but CAD software does the structural calculations. Humans still compose music, but DAWs handle the audio engineering. Humans will still do mathematics, but AI will handle the formal verification and much of the proof construction.

The difference is the timeline. Those other transitions took decades. This one is happening in years.

We Don’t Know Where the Plateau Is

When GPT-4 arrived, people marveled at its mathematical reasoning. When progress seemed to slow in late 2024, some concluded that LLMs had hit their mathematical ceiling. Then reasoning models and AI coding agents arrived and broke through again, demonstrating capabilities no one had predicted from the scaling curves.

Every time someone declares a plateau, a new capability emerges. This doesn’t mean there’s no plateau — every S-curve has one. But it does mean we should be humble about predicting where it is.

The field of mathematics — one of the oldest and most rigorous human intellectual traditions — has been completely transformed in five years. In 2019, AI couldn’t reliably do basic algebra. In 2026, it’s disproving published papers, contributing new formalizations to the world’s largest verified mathematical library, and approaching the ability to conduct autonomous research.

If you’d told any mathematician in 2019 that this would happen by 2026, they would not have believed you.

What will a mathematician in 2030 not believe about 2026?

Sources

Lean and Mathlib. The Lean Community, Mathlib: the math library of Lean 4. leanprover-community.github.io
Intuitionistic Type Theory and the Calculus of Inductive Constructions. de Moura, L. et al. (2021), “The Lean 4 Theorem Prover and Programming Language,” CADE-28.
AI for Mathematics survey. Trinh, T.H. et al. (2024), “Solving olympiad geometry without human demonstrations,” Nature, 625, 476–482.
AI mathematical reasoning. Collins, K. et al. (2024), “Evaluating Language Models for Mathematics through Interactions,” PNAS, 121(24).
Mathlib growth statistics. As of 2026, Mathlib contains over 180,000 formally verified declarations. github.com/leanprover-community/mathlib4