MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
Authors: Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate MAmmo TH on a spectrum of datasets, including in-domain (IND) test sets GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021b), AQu A-RAT (Ling et al., 2017), Num GLUE (Mishra et al., 2022b) and out-of-domain (OOD) test sets SVAMP (Patel et al., 2021), SAT (Zhong et al., 2023), MMLU-Math (Hendrycks et al., 2021a), Mathematics (Davies et al., 2021), and Simul Eq (Koncel-Kedziorski et al., 2016). Compared with existing methods, our models generalize better to OOD datasets and substantially improve the performance of open-source LLMs in mathematical reasoning. |
| Researcher Affiliation | Collaboration | University of Waterloo, The Ohio State University, HKUST, University of Edinburgh, 01.AI |
| Pseudocode | No | The paper presents code snippets as part of case studies (Figure 3, 4, 5) but does not include structured pseudocode or algorithm blocks describing their methodology. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for their methodology, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | Math Instruct is compiled from 13 math datasets with intermediate rationales, six of which have rationales newly curated by us. It presents a unique hybrid of chain-of-thought (Co T) and program-of-thought (Po T) rationales, and also ensures extensive coverage of diverse fields in math. (Table 1: GSM8K (Cobbe et al., 2021), AQu A-RAT (Ling et al., 2017), MATH (Hendrycks et al., 2021b), Theorem QA (Chen et al., 2023), Camel-Math (Li et al., 2023a), College-Math, Math QA (Amini et al., 2019), Num GLUE (Mishra et al., 2022a)) |
| Dataset Splits | No | The paper mentions training on Math Instruct and evaluating on separate test sets, but it does not specify explicit train/validation/test splits (e.g., percentages or counts) for the training process itself. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU models, CPU models, or cloud instance types. |
| Software Dependencies | No | We fine-tune all the models with Huggingface transformers library (Wolf et al., 2019). While a library is mentioned, a specific version number for the Huggingface transformers library is not provided, nor are other specific software dependencies with versions. |
| Experiment Setup | Yes | We use a learning rate of 2e-5 for the 7B and 13B models, and 1e-5 for the 34B and 70B models. We set the batch size at 128 and used a cosine scheduler with a 3% warm-up period for three epochs. |