Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SimPO: Simple Preference Optimization with a Reference-Free Reward
Authors: Yu Meng, Mengzhou Xia, Danqi Chen
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare Sim PO to DPO and its recent variants across various state-of-the-art training setups, including both base and instruction-tuned models such as Mistral, Llama 3, and Gemma 2. We evaluate on extensive chat-based evaluation benchmarks, including Alpaca Eval 2, MT-Bench, and Arena-Hard. Our results demonstrate that Sim PO consistently and significantly outperforms existing approaches without substantially increasing response length. |
| Researcher Affiliation | Academia | Yu Meng1 Mengzhou Xia2 Danqi Chen2 1Computer Science Department, University of Virginia 2Princeton Language and Intelligence (PLI), Princeton University EMAIL EMAIL |
| Pseudocode | No | The paper includes equations and figures, but no explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and models can be found at https://github.com/princeton-nlp/Sim PO. |
| Open Datasets | Yes | First, we train a base model (i.e., mistralai/Mistral-7B-v0.1, or meta-llama/Meta-Llama-3-8B) on the Ultra Chat-200k dataset [25] to obtain an SFT model. Then, we perform preference optimization on the Ultra Feedback dataset [23] using the SFT model as the starting point. |
| Dataset Splits | No | The paper mentions using a 'held-out validation set' for analysis and conducting 'preliminary experiments to search for batch sizes', which implies a validation process. However, it does not explicitly provide specific train/validation/test dataset splits with percentages or sample counts for the datasets used (Ultra Chat-200k or Ultra Feedback). |
| Hardware Specification | Yes | All the training experiments in this paper were conducted on 8 H100 GPUs based on the alignment-handbook repo. |
| Software Dependencies | No | The paper mentions using an 'Adam optimizer [43]' and refers to the 'alignment-handbook repo', but it does not specify version numbers for multiple key software components or a self-contained solver with a specific version number. |
| Experiment Setup | Yes | For the Base training setups, we train SFT models using the Ultra Chat-200k dataset [25] with the following hyperparameters: a learning rate of 2e-5, a batch size of 128, a max sequence length of 2048, and a cosine learning rate schedule with 10% warmup steps for 1 epoch. For the preference optimization stage, we conduct preliminary experiments to search for batch sizes in [32, 64, 128] and training epochs in [1, 2, 3]. We find that a batch size of 128 and a single training epoch generally yield the best results across all methods. Additionally, we set the max sequence length to be 2048 and apply a cosine learning rate schedule with 10% warmup steps on the preference optimization dataset. Table 8 shows Sim PO's hyperparameters used under each setting. |