SimPO: Simple Preference Optimization with a Reference-Free Reward

Authors: Yu Meng, Mengzhou Xia, Danqi Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare Sim PO to DPO and its recent variants across various state-of-the-art training setups, including both base and instruction-tuned models such as Mistral, Llama 3, and Gemma 2. We evaluate on extensive chat-based evaluation benchmarks, including Alpaca Eval 2, MT-Bench, and Arena-Hard. Our results demonstrate that Sim PO consistently and significantly outperforms existing approaches without substantially increasing response length.
Researcher Affiliation Academia Yu Meng1 Mengzhou Xia2 Danqi Chen2 1Computer Science Department, University of Virginia 2Princeton Language and Intelligence (PLI), Princeton University yumeng5@virginia.edu {mengzhou,danqic}@cs.princeton.edu
Pseudocode No The paper includes equations and figures, but no explicit pseudocode or algorithm blocks.
Open Source Code Yes Code and models can be found at https://github.com/princeton-nlp/Sim PO.
Open Datasets Yes First, we train a base model (i.e., mistralai/Mistral-7B-v0.1, or meta-llama/Meta-Llama-3-8B) on the Ultra Chat-200k dataset [25] to obtain an SFT model. Then, we perform preference optimization on the Ultra Feedback dataset [23] using the SFT model as the starting point.
Dataset Splits No The paper mentions using a 'held-out validation set' for analysis and conducting 'preliminary experiments to search for batch sizes', which implies a validation process. However, it does not explicitly provide specific train/validation/test dataset splits with percentages or sample counts for the datasets used (Ultra Chat-200k or Ultra Feedback).
Hardware Specification Yes All the training experiments in this paper were conducted on 8 H100 GPUs based on the alignment-handbook repo.
Software Dependencies No The paper mentions using an 'Adam optimizer [43]' and refers to the 'alignment-handbook repo', but it does not specify version numbers for multiple key software components or a self-contained solver with a specific version number.
Experiment Setup Yes For the Base training setups, we train SFT models using the Ultra Chat-200k dataset [25] with the following hyperparameters: a learning rate of 2e-5, a batch size of 128, a max sequence length of 2048, and a cosine learning rate schedule with 10% warmup steps for 1 epoch. For the preference optimization stage, we conduct preliminary experiments to search for batch sizes in [32, 64, 128] and training epochs in [1, 2, 3]. We find that a batch size of 128 and a single training epoch generally yield the best results across all methods. Additionally, we set the max sequence length to be 2048 and apply a cosine learning rate schedule with 10% warmup steps on the preference optimization dataset. Table 8 shows Sim PO's hyperparameters used under each setting.