MATH
Progress Over Time
Interactive timeline showing model performance evolution on MATH
MATH Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | OpenAI | — | — | — | ||
| 2 | OpenAI | — | — | — | ||
| 3 | Mistral AI | 14B | — | — | ||
| 3 | Mistral AI | 675B | — | — | ||
| 5 | Google | — | — | — | ||
| 6 | Moonshot AI | 1.0T | — | — | ||
| 7 | Google | 27B | — | — | ||
| 8 | Mistral AI | 8B | — | — | ||
| 9 | Google | — | — | — | ||
| 10 | Google | — | — | — | ||
| 11 | Xiaomi | 1.0T | 1.0M | $0.43 / $0.87 | ||
| 12 | OpenAI | — | — | — | ||
| 13 | OpenAI | — | — | — | ||
| 14 | Google | 12B | — | — | ||
| 15 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 15 | Alibaba Cloud / Qwen Team | 73B | — | — | ||
| 17 | Mistral AI | 3B | — | — | ||
| 18 | Alibaba Cloud / Qwen Team | 34B | — | — | ||
| 19 | Microsoft | 15B | — | — | ||
| 20 | Alibaba Cloud / Qwen Team | 15B | — | — | ||
| 21 | Anthropic | — | — | — | ||
| 22 | Google | — | — | — | ||
| 23 | 70B | — | — | |||
| 24 | Amazon | — | — | — | ||
| 24 | OpenAI | — | 128K | $2.50 / $10.00 | ||
| 26 | xAI | — | — | — | ||
| 27 | Google | 4B | — | — | ||
| 28 | Alibaba Cloud / Qwen Team | 8B | — | — | ||
| 29 | DeepSeek | 236B | — | — | ||
| 30 | 405B | — | — | |||
| 31 | Amazon | — | — | — | ||
| 32 | xAI | — | — | — | ||
| 33 | OpenAI | — | 128K | $10.00 / $30.00 | ||
| 34 | Alibaba Cloud / Qwen Team | 235B | — | — | ||
| 35 | Alibaba Cloud / Qwen Team | 7B | — | — | ||
| 36 | Anthropic | — | — | — | ||
| 37 | Mistral AI | 24B | — | — | ||
| 38 | OpenAI | — | — | — | ||
| 38 | Moonshot AI | 1.0T | — | — | ||
| 40 | Mistral AI | 24B | — | — | ||
| 41 | Anthropic | — | — | — | ||
| 42 | Mistral AI | 24B | — | — | ||
| 42 | Amazon | — | — | — | ||
| 44 | 90B | — | — | |||
| 45 | Microsoft | 4B | — | — | ||
| 46 | Meta | 400B | — | — | ||
| 47 | Anthropic | — | — | — | ||
| 48 | Alibaba Cloud / Qwen Team | 72B | — | — | ||
| 49 | Microsoft | 60B | — | — | ||
| 50 | Google | 8B | — | — |
What is MATH?
MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.
MATH is a text benchmark evaluating models on math and reasoning tasks. LLM Stats tracks 71 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 1.0.
Compare leaders on the best AI for math and best AI for reasoning leaderboards.
Current leaders
o3-mini from OpenAI currently leads the MATH leaderboard with a score of 0.979 across 71 evaluated AI models.
Source paper
- Title
- Measuring Mathematical Problem Solving With the MATH Dataset
- Authors
- Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, and 4 others
- Published
- arXiv
- 2103.03874
Abstract
Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12,500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics. Even though we are able to increase accuracy on MATH, our results show that accuracy remains relatively low, even with enormous Transformer models. Moreover, we find that simply increasing budgets and model parameter counts will be impractical for achieving strong mathematical reasoning if scaling trends continue. While scaling Transformers is automatically solving most other text-based tasks, scaling is not currently solving MATH. To have more traction on mathematical problem solving we will likely need new algorithmic advancements from the broader research community.
FAQ
Common questions about the MATH benchmark and leaderboard.