MathScale: Scaling Instruction Tuning for Mathematical Reasoning

Authors: Zhengyang Tang, Xingxing Zhang, Benyou Wang, Furu Wei

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply Math Scale QA to fine-tune open-source LLMs (e.g., LLa MA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on MWPBENCH, Math Scale-7B achieves state-of-the-art performance across all datasets
Researcher Affiliation Collaboration 1The Chinese University of Hong Kong, Shenzhen, China 2Microsoft Research Asia, Beijing, China 3Shenzhen Research Institute of Big Data, Shenzhen, China.
Pseudocode Yes 1 def is_bad_question(question): 2 question = question.lower() 4 keywords = [ 6 "which of the following", 7 "which one", 8 "which is", 9 "the following", 10 "which statement" 13 for keyword in keywords: 14 if keyword in question: 15 print(f"Filtered question: {question}") 16 return True 17 return False Listing 1. Filtering questions
Open Source Code Yes MWPBENCH is available at https://github.com/microsoft/unilm/tree/master/mathscale
Open Datasets Yes Existing Datasets Our first endeavor is to collate established datasets, including GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021b), TAL-SCQ (TAL, 2023), Math23k (Wang et al., 2017), Ape210k (Zhao et al., 2020), Gaokao Bench-Math (Zhang et al., 2023), and AGIEval (Zhong et al., 2023) series (see Table 1).
Dataset Splits Yes In total, this dataset contains 1281 examples for training and 2818 examples for test.
Hardware Specification No The paper does not provide specific hardware details (such as GPU or CPU models, or memory specifications) used for running the experiments.
Software Dependencies No The paper mentions models like 'LLa MA-2 7B and 13B' and 'Mistral 7B', and uses 'GPT-3.5-Turbo-0613' and 'GPT-4' for data generation and validation. However, it does not specify software dependencies like Python, PyTorch, or TensorFlow with their version numbers.
Experiment Setup Yes We use a batch size of 128 and train on the Math Scale QA dataset for 3 epochs using a learning rate of 2e-5. We call the resulting models Math Scale-7B, Math Scale-13B and Math Scale-Mistral-7B.