MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Authors: Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that Meta Math outperforms a suite of open-source LLMs by a significant margin. |
| Researcher Affiliation | Collaboration | 1University of Cambridge 2Southern University of Science and Technology 3Hong Kong University of Science and Technology 4Huawei Noah s Ark Lab 5The Alan Turing Institute 6Max Planck Institute for Intelligent Systems T ubingen |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release the Meta Math QA dataset, the Meta Math models with different model sizes and the training code for public use. Project page: meta-math.github.io |
| Open Datasets | Yes | Datasets. We use two popular mathematical reasoning benchmarks: (i) GSM8K [13] is a dataset consisting of high-quality grade school math problems, containing 7,473 training samples and 1,319 testing samples; and (ii) MATH [23] dataset ... It contains 7,500 and 5,000 samples for training and testing, respectively. |
| Dataset Splits | No | GSM8K [13] is a dataset consisting of high-quality grade school math problems, containing 7,473 training samples and 1,319 testing samples; and (ii) MATH [23] dataset ... It contains 7,500 and 5,000 samples for training and testing, respectively. The paper specifies training and testing samples but does not explicitly mention a validation split. |
| Hardware Specification | Yes | We use 8 NVIDIA A100 GPUs to train the 7B and 13B models |
| Software Dependencies | No | The paper mentions software components like GPT-3.5-Turbo and AdamW optimizer but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For the fully fine-tuning setting, we use the Adam W optimizer to train the model with 3 epochs and the batch size is 128. We use 8 NVIDIA A100 GPUs to train the 7B and 13B models, the learning rate is set as 2e-5 with a 3% learning rate warmup. For the 70B model QLo RA fine-tuning, the Lo RA rank and alpha are 96 and 16, with a 0.05 dropout between the two matrices. |