Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Authors: Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that Meta Math outperforms a suite of open-source LLMs by a significant margin. |
| Researcher Affiliation | Collaboration | 1University of Cambridge 2Southern University of Science and Technology 3Hong Kong University of Science and Technology 4Huawei Noah s Ark Lab 5The Alan Turing Institute 6Max Planck Institute for Intelligent Systems T ubingen |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release the Meta Math QA dataset, the Meta Math models with different model sizes and the training code for public use. Project page: meta-math.github.io |
| Open Datasets | Yes | Datasets. We use two popular mathematical reasoning benchmarks: (i) GSM8K [13] is a dataset consisting of high-quality grade school math problems, containing 7,473 training samples and 1,319 testing samples; and (ii) MATH [23] dataset ... It contains 7,500 and 5,000 samples for training and testing, respectively. |
| Dataset Splits | No | GSM8K [13] is a dataset consisting of high-quality grade school math problems, containing 7,473 training samples and 1,319 testing samples; and (ii) MATH [23] dataset ... It contains 7,500 and 5,000 samples for training and testing, respectively. The paper specifies training and testing samples but does not explicitly mention a validation split. |
| Hardware Specification | Yes | We use 8 NVIDIA A100 GPUs to train the 7B and 13B models |
| Software Dependencies | No | The paper mentions software components like GPT-3.5-Turbo and AdamW optimizer but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For the fully fine-tuning setting, we use the Adam W optimizer to train the model with 3 epochs and the batch size is 128. We use 8 NVIDIA A100 GPUs to train the 7B and 13B models, the learning rate is set as 2e-5 with a 3% learning rate warmup. For the 70B model QLo RA fine-tuning, the Lo RA rank and alpha are 96 and 16, with a 0.05 dropout between the two matrices. |