reproducibilityindex.ai

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

Authors: Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, Hongsheng Li

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the Math Coder on five datasets, including two in-domain datasets: GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021); and three out-of-domain datasets: SVAMP (Patel et al., 2021), Mathematics (Saxton et al., 2019), and Simul Eq (Kushman et al., 2014).
Researcher Affiliation	Collaboration	1Multimedia Laboratory (MMLab), The Chinese University of Hong Kong 2Shanghai AI Laboratory 3City University of Hong Kong 4Nanjing University
Pseudocode	No	The paper does not include dedicated pseudocode blocks or algorithms for its proposed methodology. Code snippets are presented as examples of the model's output rather than formal algorithmic descriptions.
Open Source Code	No	The proposed dataset and models will be released upon acceptance.
Open Datasets	Yes	We evaluate the Math Coder on five datasets, including two in-domain datasets: GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021); and three out-of-domain datasets: SVAMP (Patel et al., 2021), Mathematics (Saxton et al., 2019), and Simul Eq (Kushman et al., 2014).
Dataset Splits	No	The paper mentions using GSM8K and MATH training sets for supervised fine-tuning and evaluating on these and other datasets, but it does not specify explicit training/validation/test splits (e.g., percentages or sample counts) for any of these datasets.
Hardware Specification	Yes	The 7B, 13B, and 34B/70B models are trained on 8, 16, and 32 NVIDIA A800 80GB GPUs, respectively.
Software Dependencies	No	The paper mentions software like Deep Speed with Ze RO-3 stage, flash attention, and the Hugging Face text-generation-inference framework, but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	During training, we use a uniform learning rate of 2 10 5 and a context length of 2048, and we set the batch size as 128 with different ratios of gradient accumulation steps and per-device train batch size, considering the model size. Additionally, we used a cosine scheduler for three epochs in total with a 50-step warmup period.