Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MathScale: Scaling Instruction Tuning for Mathematical Reasoning
Authors: Zhengyang Tang, Xingxing Zhang, Benyou Wang, Furu Wei
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply Math Scale QA to fine-tune open-source LLMs (e.g., LLa MA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on MWPBENCH, Math Scale-7B achieves state-of-the-art performance across all datasets |
| Researcher Affiliation | Collaboration | 1The Chinese University of Hong Kong, Shenzhen, China 2Microsoft Research Asia, Beijing, China 3Shenzhen Research Institute of Big Data, Shenzhen, China. |
| Pseudocode | Yes | 1 def is_bad_question(question): 2 question = question.lower() 4 keywords = [ 6 "which of the following", 7 "which one", 8 "which is", 9 "the following", 10 "which statement" 13 for keyword in keywords: 14 if keyword in question: 15 print(f"Filtered question: {question}") 16 return True 17 return False Listing 1. Filtering questions |
| Open Source Code | Yes | MWPBENCH is available at https://github.com/microsoft/unilm/tree/master/mathscale |
| Open Datasets | Yes | Existing Datasets Our first endeavor is to collate established datasets, including GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021b), TAL-SCQ (TAL, 2023), Math23k (Wang et al., 2017), Ape210k (Zhao et al., 2020), Gaokao Bench-Math (Zhang et al., 2023), and AGIEval (Zhong et al., 2023) series (see Table 1). |
| Dataset Splits | Yes | In total, this dataset contains 1281 examples for training and 2818 examples for test. |
| Hardware Specification | No | The paper does not provide specific hardware details (such as GPU or CPU models, or memory specifications) used for running the experiments. |
| Software Dependencies | No | The paper mentions models like 'LLa MA-2 7B and 13B' and 'Mistral 7B', and uses 'GPT-3.5-Turbo-0613' and 'GPT-4' for data generation and validation. However, it does not specify software dependencies like Python, PyTorch, or TensorFlow with their version numbers. |
| Experiment Setup | Yes | We use a batch size of 128 and train on the Math Scale QA dataset for 3 epochs using a learning rate of 2e-5. We call the resulting models Math Scale-7B, Math Scale-13B and Math Scale-Mistral-7B. |