JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models
Authors: Kun Zhou, Beichen Zhang, jiapeng wang, Zhipeng Chen, Xin Zhao, Jing Sha, Zhichao Sheng, Shijin Wang, Ji-Rong Wen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments |
| Researcher Affiliation | Collaboration | 1School of Information, Renmin University of China 2Gaoling School of Artificial Intelligence, Renmin University of China 3i FLYTEK Research 4i FLYTEK AI Research (Central China) |
| Pseudocode | No | The paper describes methods in prose and mathematical formulations but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and data will be publicly released in https://github.com/RUCAIBox/Jiu Zhang3.0. |
| Open Datasets | Yes | Webpages: we use the Open Web Text corpus [58], which consists of 6.3M math-related web documents extracted from Common Crawl. Books: we use the Mathpile-textbook dataset [59], including 4K educational textbooks, lecture notes and synthetic books. Papers: we use the Mathpile-Arxiv dataset [59], and select the high-quality ones according to the estimated scores (0.6-0.9), which are released by Auto Math Text [60]. QA Data: we select the Stack Exchange subset of the MMIQC dataset [41], which contains 1.2M processed real-world math question-answering pairs. Wikipedia: we use the Mathpile-Wikipedia dataset [59], consisting of 106K documents from math-related entries in Wikipedia. |
| Dataset Splits | No | The paper describes training and test sets but does not explicitly provide details on a separate validation set split or its size/composition. |
| Hardware Specification | Yes | For a fair comparison, we assume that GPT-4 is utilized to synthesize training data, and 8 nodes of 8 A100 GPU servers (64 GPUs in total) are leveraged for LLMs training. |
| Software Dependencies | Yes | We train all models with BFloat16 numerical format, Flash Attention 2.0, Deep Speed Stage 2 for 7B and 8B models, and Stage 3 for 8 7B models. |
| Experiment Setup | Yes | During training, following existing work [14], we adopt a cosine learning rate schedule with a 0% warm-up ratio and select a learning rate of 1e-5 for 5 epochs and 10 epochs for natural language reasoning and tool manipulation, respectively. We reuse the optimizer to initialize the fine-tuning stage and adopt the Warmup-Stable-Decay learning rate scheduler [85] with 3% warm-up ratio and 85% stable training ratio for 1 epoch in the whole training process. We set the maximum learning rate to 1e-5 and the minimum learning rate to 1e-6 with a total batch size of 512. To boost the training efficiency, we pack multiple instances in the same context window of the model and modify the attention to avoid mutual interference among difference instances. The maximum length of model is set to 2048. |