MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning
Authors: Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, Hongsheng Li
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the Math Coder on five datasets, including two in-domain datasets: GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021); and three out-of-domain datasets: SVAMP (Patel et al., 2021), Mathematics (Saxton et al., 2019), and Simul Eq (Kushman et al., 2014). |
| Researcher Affiliation | Collaboration | 1Multimedia Laboratory (MMLab), The Chinese University of Hong Kong 2Shanghai AI Laboratory 3City University of Hong Kong 4Nanjing University |
| Pseudocode | No | The paper does not include dedicated pseudocode blocks or algorithms for its proposed methodology. Code snippets are presented as examples of the model's output rather than formal algorithmic descriptions. |
| Open Source Code | No | The proposed dataset and models will be released upon acceptance. |
| Open Datasets | Yes | We evaluate the Math Coder on five datasets, including two in-domain datasets: GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021); and three out-of-domain datasets: SVAMP (Patel et al., 2021), Mathematics (Saxton et al., 2019), and Simul Eq (Kushman et al., 2014). |
| Dataset Splits | No | The paper mentions using GSM8K and MATH training sets for supervised fine-tuning and evaluating on these and other datasets, but it does not specify explicit training/validation/test splits (e.g., percentages or sample counts) for any of these datasets. |
| Hardware Specification | Yes | The 7B, 13B, and 34B/70B models are trained on 8, 16, and 32 NVIDIA A800 80GB GPUs, respectively. |
| Software Dependencies | No | The paper mentions software like Deep Speed with Ze RO-3 stage, flash attention, and the Hugging Face text-generation-inference framework, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | During training, we use a uniform learning rate of 2 10 5 and a context length of 2048, and we set the batch size as 128 with different ratios of gradient accumulation steps and per-device train batch size, considering the model size. Additionally, we used a cosine scheduler for three epochs in total with a 50-step warmup period. |