reproducibilityindex.ai

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

Authors: Zhengyang Tang, Xingxing Zhang, Benyou Wang, Furu Wei

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply Math Scale QA to fine-tune open-source LLMs (e.g., LLa MA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on MWPBENCH, Math Scale-7B achieves state-of-the-art performance across all datasets
Researcher Affiliation	Collaboration	1The Chinese University of Hong Kong, Shenzhen, China 2Microsoft Research Asia, Beijing, China 3Shenzhen Research Institute of Big Data, Shenzhen, China.
Pseudocode	Yes	1 def is_bad_question(question): 2 question = question.lower() 4 keywords = [ 6 "which of the following", 7 "which one", 8 "which is", 9 "the following", 10 "which statement" 13 for keyword in keywords: 14 if keyword in question: 15 print(f"Filtered question: {question}") 16 return True 17 return False Listing 1. Filtering questions
Open Source Code	Yes	MWPBENCH is available at https://github.com/microsoft/unilm/tree/master/mathscale
Open Datasets	Yes	Existing Datasets Our first endeavor is to collate established datasets, including GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021b), TAL-SCQ (TAL, 2023), Math23k (Wang et al., 2017), Ape210k (Zhao et al., 2020), Gaokao Bench-Math (Zhang et al., 2023), and AGIEval (Zhong et al., 2023) series (see Table 1).
Dataset Splits	Yes	In total, this dataset contains 1281 examples for training and 2818 examples for test.
Hardware Specification	No	The paper does not provide specific hardware details (such as GPU or CPU models, or memory specifications) used for running the experiments.
Software Dependencies	No	The paper mentions models like 'LLa MA-2 7B and 13B' and 'Mistral 7B', and uses 'GPT-3.5-Turbo-0613' and 'GPT-4' for data generation and validation. However, it does not specify software dependencies like Python, PyTorch, or TensorFlow with their version numbers.
Experiment Setup	Yes	We use a batch size of 128 and train on the Math Scale QA dataset for 3 epochs using a learning rate of 2e-5. We call the resulting models Math Scale-7B, Math Scale-13B and Math Scale-Mistral-7B.