Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ComPO: Preference Alignment via Comparison Oracles

Authors: Peter Chen, Xi Chen, Wotao Yin, Tianyi Lin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations are conducted across multiple base and instruction-tuned models (Mistral-7B, Llama-3-8B and Gemma-2-9B) with benchmarks (Alpaca Eval 2, MT-Bench and Arena-Hard)1. Experimental results show the effectiveness of our method as an alternative to addressing the limitations of existing methods, not only likelihood displacement but verbosity. We conduct extensive experiments demonstrating the flexibility and effectiveness of our practical approach in improving LLM performance, particularly leveraging both clean and noisy preference data.
Researcher Affiliation	Collaboration	Peter Chen Xi Chen Wotao Yin Tianyi Lin Columbia University Stern School of Business, New York University DAMO Academy, Alibaba Group US EMAIL, EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Comparison-Based Preference Alignment (Basic Scheme) Algorithm 2 Comparison-Based Preference Alignment (Practical Scheme)
Open Source Code	Yes	Models and code: huggingface.co/Comparison PO github.com/Peter Lau Luk Chen/Comparison PO
Open Datasets	Yes	We use the Ultra Feedback dataset2[23] to train DPO and Com PO. For the DPO experiment from Table 1, we use the datasets from trl-lib (https://huggingface.co/datasets/trllib/ultrafeedback_binarized). For the Sim PO experiment from Table 2, we follow the setup of [60] and use the datasets from Hugging Face H4 (https://huggingface.co/datasets/Hugging Face H4/ultrafeedback_binarized).
Dataset Splits	No	The paper states: "We split the samples using δ = 3." and "We use the Ultra Feedback dataset from trl-lib and split the preference data into clean and noisy subsets using the margin criterion from Eq. (3)." This describes a specific criterion for splitting data into 'clean' and 'noisy' subsets, but it does not specify the typical training, validation, or test dataset splits (e.g., 80/10/10 percentages or sample counts) for the overall dataset used in the experiments. It relies on external benchmarks for evaluation, but does not provide details on how the Ultra Feedback dataset itself was split for their model training.
Hardware Specification	Yes	All the experiments are implemented in Python 3.10 with Py Torch 2.5.1 with 30 NVIDIA A40 GPUs each with 46 GB memory, equipped with Ubuntu 22.04.5 LTS.
Software Dependencies	Yes	All the experiments are implemented in Python 3.10 with Py Torch 2.5.1 with 30 NVIDIA A40 GPUs each with 46 GB memory, equipped with Ubuntu 22.04.5 LTS.
Experiment Setup	Yes	We split the samples using δ = 3. For Mistral-7B models, we set r = 0.0005, m = 1600, λg = 0.00022 and λ = 0.2. For Llama-3-8B models and Gemma-2-it-9B model, we set r = 0.00075, m = 1800, λg = 0.00008 and λ = 0.2. For the detailed information on datasets, models, and evaluation benchmarks, we defer to Appendix D. All the experiments are implemented in Python 3.10 with Py Torch 2.5.1 with 30 NVIDIA A40 GPUs each with 46 GB memory, equipped with Ubuntu 22.04.5 LTS.