SimPO: Simple Preference Optimization with a Reference-Free Reward
Authors: Yu Meng, Mengzhou Xia, Danqi Chen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare Sim PO to DPO and its recent variants across various state-of-the-art training setups, including both base and instruction-tuned models such as Mistral, Llama 3, and Gemma 2. We evaluate on extensive chat-based evaluation benchmarks, including Alpaca Eval 2, MT-Bench, and Arena-Hard. Our results demonstrate that Sim PO consistently and significantly outperforms existing approaches without substantially increasing response length. |
| Researcher Affiliation | Academia | Yu Meng1 Mengzhou Xia2 Danqi Chen2 1Computer Science Department, University of Virginia 2Princeton Language and Intelligence (PLI), Princeton University yumeng5@virginia.edu {mengzhou,danqic}@cs.princeton.edu |
| Pseudocode | No | The paper includes equations and figures, but no explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and models can be found at https://github.com/princeton-nlp/Sim PO. |
| Open Datasets | Yes | First, we train a base model (i.e., mistralai/Mistral-7B-v0.1, or meta-llama/Meta-Llama-3-8B) on the Ultra Chat-200k dataset [25] to obtain an SFT model. Then, we perform preference optimization on the Ultra Feedback dataset [23] using the SFT model as the starting point. |
| Dataset Splits | No | The paper mentions using a 'held-out validation set' for analysis and conducting 'preliminary experiments to search for batch sizes', which implies a validation process. However, it does not explicitly provide specific train/validation/test dataset splits with percentages or sample counts for the datasets used (Ultra Chat-200k or Ultra Feedback). |
| Hardware Specification | Yes | All the training experiments in this paper were conducted on 8 H100 GPUs based on the alignment-handbook repo. |
| Software Dependencies | No | The paper mentions using an 'Adam optimizer [43]' and refers to the 'alignment-handbook repo', but it does not specify version numbers for multiple key software components or a self-contained solver with a specific version number. |
| Experiment Setup | Yes | For the Base training setups, we train SFT models using the Ultra Chat-200k dataset [25] with the following hyperparameters: a learning rate of 2e-5, a batch size of 128, a max sequence length of 2048, and a cosine learning rate schedule with 10% warmup steps for 1 epoch. For the preference optimization stage, we conduct preliminary experiments to search for batch sizes in [32, 64, 128] and training epochs in [1, 2, 3]. We find that a batch size of 128 and a single training epoch generally yield the best results across all methods. Additionally, we set the max sequence length to be 2048 and apply a cosine learning rate schedule with 10% warmup steps on the preference optimization dataset. Table 8 shows Sim PO's hyperparameters used under each setting. |