Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Less is More: Improving LLM Alignment via Preference Data Selection
Authors: Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, Xiangnan He
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We organize the experiments as follows: we explain the experimental setup in Section 4.1; we compare Bee S with various sample selection baselines on diverse preference tasks and present the detailed results in Section 4.2; then we focus on the important chat task, and explore the effectiveness of Bee S in enhancing comprehensive dialogue ability in Section 4.3. Lastly, we perform diverse ablation studies for the Bee S in Section 4.4. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China, 2Peking University, 3Massachusetts Institute of Technology, 4Alibaba Cloud Computing 5Mo E Key Lab of BIPC, University of Science and Technology of China |
| Pseudocode | No | The paper includes a workflow diagram (Figure 1) but does not provide explicit pseudocode or algorithm blocks. Methods are described in paragraph text. |
| Open Source Code | Yes | We provide source code of our paper in https://github.com/xiangtanshi/DPO-Data-Selection. |
| Open Datasets | Yes | We evaluate our approach using three established preference datasets: (1) Reddit TL;DR summarization dataset [47, 42] that contains human-written summaries and human-rated results, (2) Anthropic Helpful and Harmless dialogue dataset (HH) [2], and (3) Ultra Feedback [7], which comprises quality-scored model responses across diverse prompts from multiple sources. |
| Dataset Splits | Yes | For the offline data selection setting, we compare our method with three types of methods: (1) Random, a simple yet effective strategy in many domains (e.g., Instruction tuning [52]), (2) IFD [26] (i.e., exponential form of the Point-wise Mutual Information), which measures semantic overlap. We use the difference in IFD scores between chosen and rejected responses for preference data selection. (3) External/Implicit Margin (M-Ex/Im) computes the gap between chosen and rejected responses using either external reward models or implicit DPO rewards. For (2) and (3), we segment the data into P (most positive pairs), Z (close to zero pairs), and N (most negative pairs) subsets according to margin values. Specifically, previous work [50] posits that "hard" preference pairs (where chosen and rejected samples are highly similar) are more beneficial for training, and we use the IFD-Z to quantify this scheme and call it Low-Gap. For the iterative DPO setting, we compare our approach against the standard online iterative DPO baseline established by [53, 12] and run for three rounds, each using 20k prompts sampled from Ultra Feedback. |
| Hardware Specification | Yes | For all experiments, we utilized 8 A100 GPUs. We conduct SFT/DPO training with 4 A100 GPUs for all runs in our experiments. |
| Software Dependencies | No | The paper mentions using the TRL repo for DPO experiments but does not provide specific version numbers for this or other software libraries/frameworks. |
| Experiment Setup | Yes | For DPO training, we follow [39] and use a fixed value of β = 0.1, except for TL;DR where β = 0.5. We run each training for two epochs, with a learning rate of 5 10 7, and a 0.1 warmup ratio. Following [39], we evaluate the models using 400 randomly sampled test sets from the validation/test pools of the TL;DR and HH datasets, separately. For models trained on Ultra Feedback, we employ Alpaca Eval and Alpaca Eval 2.0 [27] as our evaluation benchmark, which consists of 805 diverse questions.3 As the ground truth oracle is unavailable, we employ GPT-4 as a proxy for human judgment across three distinct settings: summarization, helpful or harmless completion, and single-turn dialogue. We utilized a fixed decoding temperature (T = 0.7) for all model generation in the experiments. More details are presented in Appendix A. The SFT training of the Base model is carried out for two epochs with a learning rate of 2 10 5. Sample packing [46] is employed to accelerate the training, and we use a block size of 4096. |