Iterative Reasoning Preference Optimization

Authors: Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, Jason Weston

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy on GSM8K, MATH, and ARC-Challenge for Llama-2-70B-Chat, outperforming other Llama-2-based models not relying on additionally sourced datasets. For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.
Researcher Affiliation Collaboration Richard Yuanzhe Pang1,2 Weizhe Yuan1,2 Kyunghyun Cho2 He He2 Sainbayar Sukhbaatar1 Jason Weston1,2 1Meta FAIR 2New York University
Pseudocode No The paper includes Figure 1 which illustrates the iterative process, but it is a diagrammatic representation and not a structured pseudocode or algorithm block.
Open Source Code No This decision is still pending. However, our experiment mostly involves relatively straightforward modifications of the DPO algorithm, and all important details are included in the beginning of each subsection in Experiments. Prompts are also included in the appendix. Given these details, it should be straightforward to replicate the experiments.
Open Datasets Yes We have confirmed that the licenses of the datasets used in this paper (MIT for GSM8K and MATH, CC BY-SA 4.0 for ARC) are respected.
Dataset Splits Yes For each iteration, we train a maximum of 5000 steps, and then select the best checkpoint using a held-out 1k samples from the training set.
Hardware Specification Yes Throughout this paper, all generation is done using one node containing eight V100 GPUs (32G memory). ... All training is done using eight nodes each containing eight A100 GPUs (80G memory).
Software Dependencies No The paper mentions using 'vLLM' for inference, but does not specify its version number or any other software dependencies with their respective versions (e.g., Python, PyTorch, or specific optimizer library versions).
Experiment Setup Yes As a seed model M0 we use the chat version of Llama-2 70B model... N = 30 solutions per problem using sampling with temperature 0.8 for iterations 1 2 and temperature 1.3 for iterations 3 4... Then we generate K = 10 pairs per problem for training... For each iteration, we train a maximum of 5000 steps... The coefficient α is tuned in {0.25, 0.5, 1, 2} when training M1, and we end up using 1 for all experiments in the paper. The coefficient β in the DPO loss is tuned in {0.05, 0.1, 0.5, 1.0}, and we end up using 0.1 in this experiment. We use a batch size of 16 and a learning rate 7e-7 using the Adam W optimizer.