MathScale: Scaling Instruction Tuning for Mathematical Reasoning
Authors: Zhengyang Tang, Xingxing Zhang, Benyou Wang, Furu Wei
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply Math Scale QA to fine-tune open-source LLMs (e.g., LLa MA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on MWPBENCH, Math Scale-7B achieves state-of-the-art performance across all datasets |
| Researcher Affiliation | Collaboration | 1The Chinese University of Hong Kong, Shenzhen, China 2Microsoft Research Asia, Beijing, China 3Shenzhen Research Institute of Big Data, Shenzhen, China. |
| Pseudocode | Yes | 1 def is_bad_question(question): 2 question = question.lower() 4 keywords = [ 6 "which of the following", 7 "which one", 8 "which is", 9 "the following", 10 "which statement" 13 for keyword in keywords: 14 if keyword in question: 15 print(f"Filtered question: {question}") 16 return True 17 return False Listing 1. Filtering questions |
| Open Source Code | Yes | MWPBENCH is available at https://github.com/microsoft/unilm/tree/master/mathscale |
| Open Datasets | Yes | Existing Datasets Our first endeavor is to collate established datasets, including GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021b), TAL-SCQ (TAL, 2023), Math23k (Wang et al., 2017), Ape210k (Zhao et al., 2020), Gaokao Bench-Math (Zhang et al., 2023), and AGIEval (Zhong et al., 2023) series (see Table 1). |
| Dataset Splits | Yes | In total, this dataset contains 1281 examples for training and 2818 examples for test. |
| Hardware Specification | No | The paper does not provide specific hardware details (such as GPU or CPU models, or memory specifications) used for running the experiments. |
| Software Dependencies | No | The paper mentions models like 'LLa MA-2 7B and 13B' and 'Mistral 7B', and uses 'GPT-3.5-Turbo-0613' and 'GPT-4' for data generation and validation. However, it does not specify software dependencies like Python, PyTorch, or TensorFlow with their version numbers. |
| Experiment Setup | Yes | We use a batch size of 128 and train on the Math Scale QA dataset for 3 epochs using a learning rate of 2e-5. We call the resulting models Math Scale-7B, Math Scale-13B and Math Scale-Mistral-7B. |