MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Authors: Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate MAmmo TH on a spectrum of datasets, including in-domain (IND) test sets GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021b), AQu A-RAT (Ling et al., 2017), Num GLUE (Mishra et al., 2022b) and out-of-domain (OOD) test sets SVAMP (Patel et al., 2021), SAT (Zhong et al., 2023), MMLU-Math (Hendrycks et al., 2021a), Mathematics (Davies et al., 2021), and Simul Eq (Koncel-Kedziorski et al., 2016). Compared with existing methods, our models generalize better to OOD datasets and substantially improve the performance of open-source LLMs in mathematical reasoning.
Researcher Affiliation Collaboration University of Waterloo, The Ohio State University, HKUST, University of Edinburgh, 01.AI
Pseudocode No The paper presents code snippets as part of case studies (Figure 3, 4, 5) but does not include structured pseudocode or algorithm blocks describing their methodology.
Open Source Code No The paper does not contain an explicit statement about releasing source code for their methodology, nor does it provide a direct link to a code repository.
Open Datasets Yes Math Instruct is compiled from 13 math datasets with intermediate rationales, six of which have rationales newly curated by us. It presents a unique hybrid of chain-of-thought (Co T) and program-of-thought (Po T) rationales, and also ensures extensive coverage of diverse fields in math. (Table 1: GSM8K (Cobbe et al., 2021), AQu A-RAT (Ling et al., 2017), MATH (Hendrycks et al., 2021b), Theorem QA (Chen et al., 2023), Camel-Math (Li et al., 2023a), College-Math, Math QA (Amini et al., 2019), Num GLUE (Mishra et al., 2022a))
Dataset Splits No The paper mentions training on Math Instruct and evaluating on separate test sets, but it does not specify explicit train/validation/test splits (e.g., percentages or counts) for the training process itself.
Hardware Specification No The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU models, CPU models, or cloud instance types.
Software Dependencies No We fine-tune all the models with Huggingface transformers library (Wolf et al., 2019). While a library is mentioned, a specific version number for the Huggingface transformers library is not provided, nor are other specific software dependencies with versions.
Experiment Setup Yes We use a learning rate of 2e-5 for the 7B and 13B models, and 1e-5 for the 34B and 70B models. We set the batch size at 128 and used a cosine scheduler with a 3% warm-up period for three epochs.