DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving

Authors: Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, Junxian He

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In comprehensive in-domain and out-of-domain evaluation on 6 mathematical benchmarks, DART-Math outperforms vanilla rejection tuning significantly, being superior or comparable to previous arts, despite using much smaller datasets and no proprietary models.
Researcher Affiliation Collaboration Yuxuan Tong 1, Xiwen Zhang2, Rui Wang2, Ruidong Wu2, Junxian He3 1Tsinghua University 2Helixon Research 3HKUST tongyx21@mails.tsinghua.edu.cn junxianh@cse.ust.hk
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our datasets, models and code are publicly available at https://github.com/hkust-nlp/dart-math.
Open Datasets Yes We utilize DARS-Uniform and DARS-Prop2Diff to construct two datasets, DART-Math-Uniform and DART-Math-Hard respectively for instruction tuning. We use the original training queries of the GSM8K (Cobbe et al., 2021a) and MATH datasets to synthesize responses.
Dataset Splits Yes We utilize the original training queries of the GSM8K (Cobbe et al., 2021a) and MATH datasets to synthesize responses. [...] Specifically, we use the GSM8K and MATH test set as the in-domain test.
Hardware Specification Yes For 7B or 8B models, we train on 8 NVIDIA A100 GPUs. For 70B models, we train on 32 NVIDIA A100 GPUs. [...] In our setting, sampling 35k samples on MATH / GSM8k queries takes about 1 NVIDIA A100 GPU hour. [...] sampling approximately 150 million samples in total, which required running inference of Deep Seek Math-7B-RL for about 160 NVIDIA A100 GPU days.
Software Dependencies No The paper mentions software like Transformers, vLLM, Deep Speed, and SymPy, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes Batch Size: The computation sequence token length is set to 4096, considering that most sequences in the training datasets are shorter than this length. The batch size is 64, though there are usually more than 64 samples in one batch because one computation sequence can pack multiple semantic sequences. We disable gradient accumulation (Lin et al., 2018) by default, but when the memory is not sufficient, we increase the number of gradient accumulation steps and keep other settings unchanged. Specifically, we use 2 gradient accumulation steps when training Llama3-8B on 8 NVIDIA A100 GPUs under our setting. Learning Rate: We use the Adam optimizer (Zhang, 2018) with the weight decay as 0. We use a linear warmup with a warmup step ratio of 0.03 and cosine learning rate scheduler. The maximum learning rates are set as follows: Mistral-7B at 1e-5, Deep Seek Math-7B and Llama3-8B at 5e-5, and Llama3-70B at 2e-5. We determine the values by searching through 1e-6,5e-6,1e-5,2e-5,5e-5,1e-4 according to the MATH performance after training on MMIQC for 1 epoch. # Training Epochs: The default number of epochs is 3. For MMIQC, we train for 1 epoch following Liu et al. (2024a). For Llama3 models, we train for 1 epoch because preliminary experiments indicate that 1 epoch consistently outperforms 3 epochs. Prompt Template: For the prompt template, we use the format following Taori et al. (2023):