Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics
Authors: Fan, Sarah Martinson, Erik Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, Michael Brenner
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate both openand closedsource LLMs on HARDMATH-MINI, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts. Even leading closed-source models like GPT-4 achieve only 43.8% overall accuracy with fewshot Chain-of-Thought prompting, and all models demonstrate significantly lower performance compared to results on existing mathematics benchmark datasets. |
| Researcher Affiliation | Academia | Jingxuan Fan , Sarah Martinson , Erik Y. Wang , Kaylie Hausknecht , Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, Michael P. Brenner School of Engineering and Applied Sciences, Harvard University |
| Pseudocode | No | The paper describes algorithms to automatically generate problems and their step-by-step solutions and presents a flowchart for the data generation procedure (Fig. 2), but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Dataset: https://github.com/sarahmart/HARDMath |
| Open Datasets | Yes | To address this gap, we introduce HARDMATH, a dataset specifically designed to focus on asymptotic reasoning in mathematics. This dataset captures a fundamentally different type of mathematical reasoning compared to other benchmarks and can be useful for evaluating LLMs abilities to make research-relevant approximations. |
| Dataset Splits | Yes | The main HARDMATH dataset, which can be used for model developments (e.g. novel prompting techniques or fine-tuning), contains 1,060 problems, and the evaluation dataset HARDMATHMINI, which we use in this paper to benchmark LLM performance, contains 366 problems. |
| Hardware Specification | Yes | Evaluations of open-source models on HARDMATH are conducted on a high-performance compute cluster with a single Tesla V100 GPU (16GB vram). Evaluation on one problem type typically takes less than 1 hour. |
| Software Dependencies | No | Code for data generation uses Sym Py (Meurer et al., 2017), a library for symbolic mathematics, and Sci Py, a library for scientific computing (Virtanen et al., 2020), to implement the mathematical procedures required for obtaining approximate, analytical solutions. However, specific version numbers for these libraries are not provided. |
| Experiment Setup | Yes | We compare the performance of several closedand open-source models on HARDMATH in zeroand few-shot settings with the Chain-of-Thought (Co T) (Wei et al., 2023) prompting. We provide the prompts and hyper-parameters for LLMs evaluations in Appendix A.3.4 Table 7. |