Back to homepage Published May 29th 2026.

On LLM Math Capabilities

Dmitry Rybin

Proof tree illustration for LLM math search

This short note is aimed at mathematicians. I explain the importance of test-time scaling, show how to calculate that the price of Unit Distance Conjecture resolution was <$100, and estimate GPT-5.5-Pro one-shot success rate on minor open math problems to be 0.4%.

Scaling Reasoning for Proof Search

Modern LLMs like GPT-5.5-Thinking, Gemini 3.1 Pro have an adjustable thinking effort (how many tokens they generate linearly one-by-one in chain-of-thought before giving an answer). Higher thinking effort consistently shows scalably better performance in almost all tasks. A 10 minute reasoning at the typical speed of 50 tokens per second generates 30 000 thinking tokens. This is below the modern LLM context window of 400 000 tokens. Here is a curve from OpenAI showing how success rate changes from doubling number of reasoning tokens:

CoT scaling

We can compare it to the naive best of n scaling and observe that reasoning compute is scaling better than parallel sampling.

Compared to best of n scaling

What about GPT-5.5-Pro, Gemini DeepThink? Their structure is not public but the evidence points towards it being a simple parallel think-combine pipeline of \(\approx 5-100\) copies of the normal thinking models, combined with some refinement steps. This is parallel scaling axis of test-time compute - we can improve the results by increasing parallel compute and summarising best-found ideas.

One can introduce more compute scaling by prompting other copies of LLMs to judge and refine results. This gives us pipeline in a style of Aletheia1 of Google DeepMind and Rethlas2 of PekingU/BICMR.

Final parallel scaling axis is directly indicating in the prompt the direction that the proof should try: "solve using algebraic number theory / probability theory / arithmetic sieves". This is not so widespread but already tried in the literature.

In total we have 4 scaling axis:

  1. Longer reasoning
  2. Parallel copies scaling
  3. Generator-verifier iterations scaling
  4. scaling by research direction prompting

Assume a 60 000 tokens reasoning, scaled 10x in parallel, with 10 generator-verifier iterations, and 10 directions. The total API costs with GPT-5.5-Thinking become \($1.80 \times 10 \times 20 \times 10 = $360\), whoops. OpenAI researchers, especially Noam Brown and Sebastien Bubeck, repeatedly emphasize that such inference-time compute scaling will get more important.

As an exercise let's calculate the raw token cost of the Unit Distance Conjecture solution. OpenAI reports a 120 pages of summarized Chain-of-Thought3. Let's assume a summarization factor of 20, so the full Chain-of-Thought would be 2400 pages. This is at most 1-2M tokens. The most expensive raw model inference cost i know of is $50 per 1M tokens, which gives an upper bound of $100 total. If you use GPT-5.5-Pro API cost $270/1M tokens, you get $540 total cost.

Are LLMs much better at combinatorics?

I do not think current LLMs are fundamentally much better at combinatorics than at the rest of mathematics. We have seen impressive LLM-assisted results in analytic number theory, algebraic number theory, probability, and optimization as well.

From what I know about how DeepSeek and Kimi train math models, there is no major intentional bias toward combinatorics. The only small difference is data generation: I know how to create unlimited difficult labeled combinatorics data automatically using MILP solvers. This is convenient, but I do not think it changes the picture by an order of magnitude.

The difference probably comes from an order of magnitude more attention to Erdos problems database.

Other remarks

  1. "Are LLMs limited to human proof strategies?" - no, RL training process allows them to go beyond human proofs.
  2. "Do LLMs come up with novel ideas and definitions?" - there are minor signs of that happening, but no impressive examples yet.
  3. "Can LLMs solve 0.4% of open problems or solve any problem with 0.4% success rate?" - mostly the former, but we need more experiments to determine.
  4. "Can LLMs learn research taste?" - Noam Brown (OpenAI) and Thang Luong (Google DeepMind) think so. I trust them, and also I see a lot of indication that this is mostly a problem of designing the right information signal.

Notes

  1. Tony Feng et al., Towards Autonomous Mathematics Research, arXiv, 2026. This is the Aletheia paper.
  2. Haocheng Ju et al., Automated Conjecture Resolution with Formal Verification, arXiv, 2026. This introduces Rethlas and Archon.
  3. OpenAI, An OpenAI model has disproved a central conjecture in discrete geometry, May 20, 2026.
  4. Terence Tao et al., AI contributions to Erdős problems, GitHub wiki.

If you want to cite this note

@misc{Rybin2026LLMMathSearch,
  author = {Rybin, Dmitry},
  title = {On LLM Math Capabilities},
  year = {2026},
  howpublished = {\url{https://rybindmitry.github.io/blogs/understanding-llm-math-capabilities-via-search.html}},
  note = {Blog post}
}