Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ComPO: Preference Alignment via Comparison Oracles
Authors: Peter Chen, Xi Chen, Wotao Yin, Tianyi Lin
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations are conducted across multiple base and instruction-tuned models (Mistral-7B, Llama-3-8B and Gemma-2-9B) with benchmarks (Alpaca Eval 2, MT-Bench and Arena-Hard)1. Experimental results show the effectiveness of our method as an alternative to addressing the limitations of existing methods, not only likelihood displacement but verbosity. We conduct extensive experiments demonstrating the flexibility and effectiveness of our practical approach in improving LLM performance, particularly leveraging both clean and noisy preference data. |
| Researcher Affiliation | Collaboration | Peter Chen Xi Chen Wotao Yin Tianyi Lin Columbia University Stern School of Business, New York University DAMO Academy, Alibaba Group US EMAIL, EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Comparison-Based Preference Alignment (Basic Scheme) Algorithm 2 Comparison-Based Preference Alignment (Practical Scheme) |
| Open Source Code | Yes | Models and code: huggingface.co/Comparison PO github.com/Peter Lau Luk Chen/Comparison PO |
| Open Datasets | Yes | We use the Ultra Feedback dataset2[23] to train DPO and Com PO. For the DPO experiment from Table 1, we use the datasets from trl-lib (https://huggingface.co/datasets/trllib/ultrafeedback_binarized). For the Sim PO experiment from Table 2, we follow the setup of [60] and use the datasets from Hugging Face H4 (https://huggingface.co/datasets/Hugging Face H4/ultrafeedback_binarized). |
| Dataset Splits | No | The paper states: "We split the samples using δ = 3." and "We use the Ultra Feedback dataset from trl-lib and split the preference data into clean and noisy subsets using the margin criterion from Eq. (3)." This describes a specific criterion for splitting data into 'clean' and 'noisy' subsets, but it does not specify the typical training, validation, or test dataset splits (e.g., 80/10/10 percentages or sample counts) for the overall dataset used in the experiments. It relies on external benchmarks for evaluation, but does not provide details on how the Ultra Feedback dataset itself was split for their model training. |
| Hardware Specification | Yes | All the experiments are implemented in Python 3.10 with Py Torch 2.5.1 with 30 NVIDIA A40 GPUs each with 46 GB memory, equipped with Ubuntu 22.04.5 LTS. |
| Software Dependencies | Yes | All the experiments are implemented in Python 3.10 with Py Torch 2.5.1 with 30 NVIDIA A40 GPUs each with 46 GB memory, equipped with Ubuntu 22.04.5 LTS. |
| Experiment Setup | Yes | We split the samples using δ = 3. For Mistral-7B models, we set r = 0.0005, m = 1600, λg = 0.00022 and λ = 0.2. For Llama-3-8B models and Gemma-2-it-9B model, we set r = 0.00075, m = 1800, λg = 0.00008 and λ = 0.2. For the detailed information on datasets, models, and evaluation benchmarks, we defer to Appendix D. All the experiments are implemented in Python 3.10 with Py Torch 2.5.1 with 30 NVIDIA A40 GPUs each with 46 GB memory, equipped with Ubuntu 22.04.5 LTS. |