MATH

Paper

Progress Over Time

Interactive timeline showing model performance evolution on MATH

State-of-the-art frontier
Open
Proprietary

MATH Leaderboard

71 models
ContextCostLicense
1
OpenAI
OpenAI
2
OpenAI
OpenAI
314B
3
Mistral AI
Mistral AI
675B
5
6
Moonshot AI
Moonshot AI
1.0T
727B
88B
9
10
111.0T1.0M$0.43 / $0.87
12
13
OpenAI
OpenAI
1412B
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
33B
15
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
73B
173B
18
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
34B
19
Microsoft
Microsoft
15B
20
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
15B
21
22
2370B
24
Amazon
Amazon
24
OpenAI
OpenAI
128K$2.50 / $10.00
26
274B
28
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
8B
29236B
30405B
31
Amazon
Amazon
32
33128K$10.00 / $30.00
34
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
235B
35
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
36
3724B
38
38
Moonshot AI
Moonshot AI
1.0T
4024B
41
4224B
42
4490B
45
Microsoft
Microsoft
4B
46400B
47
Anthropic
Anthropic
48
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
72B
4960B
508B
1–50 of 71
1/2
Notice missing or incorrect data?
About this benchmark

What is MATH?

MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.

MATH is a text benchmark evaluating models on math and reasoning tasks. LLM Stats tracks 71 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 1.0.

Compare leaders on the best AI for math and best AI for reasoning leaderboards.

Current leaders

o3-mini from OpenAI currently leads the MATH leaderboard with a score of 0.979 across 71 evaluated AI models.

1o3-miniOpenAI97.9%
2o1OpenAI96.4%

Source paper

Title
Measuring Mathematical Problem Solving With the MATH Dataset
Authors
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, and 4 others
Published
Abstract

Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.

FAQ

Common questions about the MATH benchmark and leaderboard.

What is the MATH benchmark?

MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.

What is the MATH leaderboard?

The MATH leaderboard ranks 71 AI models based on their performance on this benchmark. Currently, o3-mini by OpenAI leads with a score of 0.979. The average score across all models is 0.671.

What is the highest MATH score?

The highest MATH score is 0.979, achieved by o3-mini from OpenAI.

How many models are evaluated on MATH?

71 models have been evaluated on the MATH benchmark, with 0 verified results and 69 self-reported results.

Where can I find the MATH paper?

The MATH paper is available at https://arxiv.org/abs/2103.03874. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does MATH cover?

MATH is categorized under math and reasoning. The benchmark evaluates text models.

What is the best open-source model on MATH?

MiniStral 3 (14B Instruct 2512) by Mistral AI is the top-ranked open-source model on MATH, with a score of 0.904 (rank #3).

How recent are the MATH leaderboard results?

The MATH leaderboard was last updated in July 2026 and currently includes 71 evaluated models.