Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RePO: Understanding Preference Learning Through ReLU-Based Optimization

Authors: Junkang Wu, Kexin Huang, xue wang, Jinyang Gao, Bolin Ding, Jiancan Wu, Xiangnan He, Xiang Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that over-optimization does exist, but a threshold parameter γ plays an essential role in preventing it by dynamically filtering training examples. We further provide theoretical analysis demonstrating that Re LU-based Preference Optimization (Re PO) corresponds to the convex envelope of the 0-1 loss, establishing its fundamental soundness. Our Re PO method achieves competitive or superior results compared to established preference optimization approaches.
Researcher Affiliation Collaboration 1University of Science and Technology of China, 2Alibaba Group 3Institute of Dataspace, Hefei Comprehensive National Science Center 4Mo E Key Lab of BIPC, University of Science and Technology of China EMAIL
Pseudocode No The paper contains mathematical equations (e.g., Equation 6 for LRe PO(πθ)) and descriptive text explaining the method, but no dedicated pseudocode or algorithm blocks are present.
Open Source Code Yes The code is available at https://github.com/junkangwu/Re PO.
Open Datasets Yes For consistency, we use the same training datasets as Sim PO: princeton-nlp/llama3-ultrafeedback-armorm for Llama3-8B and princeton-nlp/gemma2-ultrafeedback-armorm for Gemma2-9B.
Dataset Splits No The paper mentions using training datasets (e.g., princeton-nlp/llama3-ultrafeedback-armorm) and evaluating on benchmarks like Alpaca Eval 2 and Arena Hard, but it does not explicitly provide details on how the training data was split into training, validation, or test sets for reproduction, beyond mentioning a single epoch training and held-out development sets for hyperparameter tuning.
Hardware Specification Yes All training experiments described in this paper were conducted using 8 A100 GPUs.
Software Dependencies No The paper mentions the use of the Adam optimizer, Llama3-8B and Gemma2-9B models, and full-precision floating-point arithmetic. However, it does not specify explicit version numbers for key software components such as Python, PyTorch, or CUDA, which are necessary for reproducible software dependencies.
Experiment Setup Yes Batch size: 128 (consistent across methods) Learning rate: Searched in {3e-7, 5e-7, 8e-7, 1e-6} Training duration: Single epoch with cosine annealing schedule Warmup: 10% of total training steps Optimizer: Adam [13] (β1 = 0.9, β2 = 0.999) Sequence length: 2048 tokens (fixed for all inputs) and Table 4: The hyperparameter values in Re PO used for each training setting. Setting γ Learning rate Mistral-Instruct 0.4 6e-7 Llama3-Instruct 0.6 1e-6 Llama3-Instruct-v0.2 0.6 1e-6 Gemma2-Instruct 0.4 8e-7