JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models

Authors: Kun Zhou, Beichen Zhang, jiapeng wang, Zhipeng Chen, Xin Zhao, Jing Sha, Zhichao Sheng, Shijin Wang, Ji-Rong Wen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments
Researcher Affiliation Collaboration 1School of Information, Renmin University of China 2Gaoling School of Artificial Intelligence, Renmin University of China 3i FLYTEK Research 4i FLYTEK AI Research (Central China)
Pseudocode No The paper describes methods in prose and mathematical formulations but does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our code and data will be publicly released in https://github.com/RUCAIBox/Jiu Zhang3.0.
Open Datasets Yes Webpages: we use the Open Web Text corpus [58], which consists of 6.3M math-related web documents extracted from Common Crawl. Books: we use the Mathpile-textbook dataset [59], including 4K educational textbooks, lecture notes and synthetic books. Papers: we use the Mathpile-Arxiv dataset [59], and select the high-quality ones according to the estimated scores (0.6-0.9), which are released by Auto Math Text [60]. QA Data: we select the Stack Exchange subset of the MMIQC dataset [41], which contains 1.2M processed real-world math question-answering pairs. Wikipedia: we use the Mathpile-Wikipedia dataset [59], consisting of 106K documents from math-related entries in Wikipedia.
Dataset Splits No The paper describes training and test sets but does not explicitly provide details on a separate validation set split or its size/composition.
Hardware Specification Yes For a fair comparison, we assume that GPT-4 is utilized to synthesize training data, and 8 nodes of 8 A100 GPU servers (64 GPUs in total) are leveraged for LLMs training.
Software Dependencies Yes We train all models with BFloat16 numerical format, Flash Attention 2.0, Deep Speed Stage 2 for 7B and 8B models, and Stage 3 for 8 7B models.
Experiment Setup Yes During training, following existing work [14], we adopt a cosine learning rate schedule with a 0% warm-up ratio and select a learning rate of 1e-5 for 5 epochs and 10 epochs for natural language reasoning and tool manipulation, respectively. We reuse the optimizer to initialize the fine-tuning stage and adopt the Warmup-Stable-Decay learning rate scheduler [85] with 3% warm-up ratio and 85% stable training ratio for 1 epoch in the whole training process. We set the maximum learning rate to 1e-5 and the minimum learning rate to 1e-6 with a total batch size of 512. To boost the training efficiency, we pack multiple instances in the same context window of the model and modify the attention to avoid mutual interference among difference instances. The maximum length of model is set to 2048.