MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Authors: Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that Meta Math outperforms a suite of open-source LLMs by a significant margin.
Researcher Affiliation Collaboration 1University of Cambridge 2Southern University of Science and Technology 3Hong Kong University of Science and Technology 4Huawei Noah s Ark Lab 5The Alan Turing Institute 6Max Planck Institute for Intelligent Systems T ubingen
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We release the Meta Math QA dataset, the Meta Math models with different model sizes and the training code for public use. Project page: meta-math.github.io
Open Datasets Yes Datasets. We use two popular mathematical reasoning benchmarks: (i) GSM8K [13] is a dataset consisting of high-quality grade school math problems, containing 7,473 training samples and 1,319 testing samples; and (ii) MATH [23] dataset ... It contains 7,500 and 5,000 samples for training and testing, respectively.
Dataset Splits No GSM8K [13] is a dataset consisting of high-quality grade school math problems, containing 7,473 training samples and 1,319 testing samples; and (ii) MATH [23] dataset ... It contains 7,500 and 5,000 samples for training and testing, respectively. The paper specifies training and testing samples but does not explicitly mention a validation split.
Hardware Specification Yes We use 8 NVIDIA A100 GPUs to train the 7B and 13B models
Software Dependencies No The paper mentions software components like GPT-3.5-Turbo and AdamW optimizer but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes For the fully fine-tuning setting, we use the Adam W optimizer to train the model with 3 epochs and the batch size is 128. We use 8 NVIDIA A100 GPUs to train the 7B and 13B models, the learning rate is set as 2e-5 with a 3% learning rate warmup. For the 70B model QLo RA fine-tuning, the Lo RA rank and alpha are 96 and 16, with a 0.05 dropout between the two matrices.