Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics

Authors: Fan, Sarah Martinson, Erik Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, Michael Brenner

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate both openand closedsource LLMs on HARDMATH-MINI, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts. Even leading closed-source models like GPT-4 achieve only 43.8% overall accuracy with fewshot Chain-of-Thought prompting, and all models demonstrate significantly lower performance compared to results on existing mathematics benchmark datasets.
Researcher Affiliation	Academia	Jingxuan Fan , Sarah Martinson , Erik Y. Wang , Kaylie Hausknecht , Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, Michael P. Brenner School of Engineering and Applied Sciences, Harvard University
Pseudocode	No	The paper describes algorithms to automatically generate problems and their step-by-step solutions and presents a flowchart for the data generation procedure (Fig. 2), but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Dataset: https://github.com/sarahmart/HARDMath
Open Datasets	Yes	To address this gap, we introduce HARDMATH, a dataset specifically designed to focus on asymptotic reasoning in mathematics. This dataset captures a fundamentally different type of mathematical reasoning compared to other benchmarks and can be useful for evaluating LLMs abilities to make research-relevant approximations.
Dataset Splits	Yes	The main HARDMATH dataset, which can be used for model developments (e.g. novel prompting techniques or fine-tuning), contains 1,060 problems, and the evaluation dataset HARDMATHMINI, which we use in this paper to benchmark LLM performance, contains 366 problems.
Hardware Specification	Yes	Evaluations of open-source models on HARDMATH are conducted on a high-performance compute cluster with a single Tesla V100 GPU (16GB vram). Evaluation on one problem type typically takes less than 1 hour.
Software Dependencies	No	Code for data generation uses Sym Py (Meurer et al., 2017), a library for symbolic mathematics, and Sci Py, a library for scientific computing (Virtanen et al., 2020), to implement the mathematical procedures required for obtaining approximate, analytical solutions. However, specific version numbers for these libraries are not provided.
Experiment Setup	Yes	We compare the performance of several closedand open-source models on HARDMATH in zeroand few-shot settings with the Chain-of-Thought (Co T) (Wei et al., 2023) prompting. We provide the prompts and hyper-parameters for LLMs evaluations in Appendix A.3.4 Table 7.