reproducibilityindex.ai

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Authors: Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate MAmmo TH on a spectrum of datasets, including in-domain (IND) test sets GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021b), AQu A-RAT (Ling et al., 2017), Num GLUE (Mishra et al., 2022b) and out-of-domain (OOD) test sets SVAMP (Patel et al., 2021), SAT (Zhong et al., 2023), MMLU-Math (Hendrycks et al., 2021a), Mathematics (Davies et al., 2021), and Simul Eq (Koncel-Kedziorski et al., 2016). Compared with existing methods, our models generalize better to OOD datasets and substantially improve the performance of open-source LLMs in mathematical reasoning.
Researcher Affiliation	Collaboration	University of Waterloo, The Ohio State University, HKUST, University of Edinburgh, 01.AI
Pseudocode	No	The paper presents code snippets as part of case studies (Figure 3, 4, 5) but does not include structured pseudocode or algorithm blocks describing their methodology.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code for their methodology, nor does it provide a direct link to a code repository.
Open Datasets	Yes	Math Instruct is compiled from 13 math datasets with intermediate rationales, six of which have rationales newly curated by us. It presents a unique hybrid of chain-of-thought (Co T) and program-of-thought (Po T) rationales, and also ensures extensive coverage of diverse fields in math. (Table 1: GSM8K (Cobbe et al., 2021), AQu A-RAT (Ling et al., 2017), MATH (Hendrycks et al., 2021b), Theorem QA (Chen et al., 2023), Camel-Math (Li et al., 2023a), College-Math, Math QA (Amini et al., 2019), Num GLUE (Mishra et al., 2022a))
Dataset Splits	No	The paper mentions training on Math Instruct and evaluating on separate test sets, but it does not specify explicit train/validation/test splits (e.g., percentages or counts) for the training process itself.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU models, CPU models, or cloud instance types.
Software Dependencies	No	We fine-tune all the models with Huggingface transformers library (Wolf et al., 2019). While a library is mentioned, a specific version number for the Huggingface transformers library is not provided, nor are other specific software dependencies with versions.
Experiment Setup	Yes	We use a learning rate of 2e-5 for the 7B and 13B models, and 1e-5 for the 34B and 70B models. We set the batch size at 128 and used a cosine scheduler with a 3% warm-up period for three epochs.