Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Adaptive Batch-Wise Sample Scheduling for Direct Preference Optimization
Authors: Zixuan Huang, Yikun Ban, Lean Fu, Xiaojie Li, Zhongxiang Dai, Jianxin Li, deqing wang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments across diverse benchmarks, including Alpaca Eval 2 [29] and MT-Bench [100], to evaluate the effectiveness of Sam S. Notably, when integrated with the original DPO loss, Sam S consistently outperforms several advanced offline preference optimization methods on mainstream evaluation benchmarks. Particularly, our method improves the Alpaca Eval 2 win rate (WR) by 3.0% 12.4% and the length-controlled win rate (LC) by 5.5% 8.4% compared to the baselines. Furthermore, we conduct a thorough evaluation of Sam S under noisy preference data conditions and show that its integration significantly enhances robustness against label noise. |
| Researcher Affiliation | Collaboration | 1Beihang University 2Bytedance Inc 3The Chinese University of Hong Kong, Shenzhen EMAIL EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Proposed Algorithm: Sam S |
| Open Source Code | Yes | The code is available at https://github.com/hzx122/Sam S. |
| Open Datasets | Yes | Detailed information about the datasets used in the experiments is presented in Table 5. For HH and SHP, we directly utilize the open-source data available on Hugging Face. For Ultra Feedback, to ensure that the chosen responses in the training samples during preference optimization are indistribution, we use only the prompts from the dataset and generate the offline preference dataset following the approach described in Appendix D.1. |
| Dataset Splits | Yes | Detailed information about the datasets used in the experiments is presented in Table 5. For HH and SHP, we directly utilize the open-source data available on Hugging Face. For Ultra Feedback, to ensure that the chosen responses in the training samples during preference optimization are indistribution, we use only the prompts from the dataset and generate the offline preference dataset following the approach described in Appendix D.1. Table 5: Statistical information about the training datasets used in the experiments. Dataset |Dtrain| |Dtest| Type HH 160800 8552 Helpful & Harmless SHP 348718 18409 Hybrid Ultra Feedback-Mistral 56904 1866 Hybrid Ultra Feedback-Llama3 58119 1906 Hybrid Ultra Feedback-Llama3-v0.2 59876 1961 Hybrid Ultra Feedback-Gemma-v0.2 59569 1941 Hybrid |
| Hardware Specification | Yes | All the training experiments in this paper were conducted on 8 A100 GPUs. |
| Software Dependencies | No | The paper mentions software like DPO, Pythia-2.8B, Mistral-7B-Instruct-v0.2, Meta-Llama-3-8B-Instruct, llm-blender/Pair RM, RLHFlow/Armo RM-Llama3-8B-v0.1, and google/gemma-2-9b-it, but does not provide specific version numbers for any programming languages or libraries used for implementation. |
| Experiment Setup | Yes | In this section, we present the primary experimental results along with their analysis. For Sam S, both the exploitation and exploration modules are implemented as 16-layer residual MLPs. We set the batch size |Xt| to 64 and the selection size | e Xt| to 32 across all training rounds. Additional implementation details of Sam S are provided in Appendix D due to space constraints. D.1 Experimental Setup Scheduler Settings. For the encoder layer of f, we initialize it with all-Mini LM-L6-v2. To improve the training efficiency, We pretrain the encoder layer offline and freeze its weights during the preference optimization process. The specific training details are provided in the Appendix F.3. For the Exploitation Network f S, we set its width m = 4096 and depth L = 16. As described in Section 4, we first concatenate the hidden states of f S. Then, we perform downsampling using a parameter of 4, which entails calculating the average of every four consecutive positions. For the Exploration Network f S , we also set its depth L = 16. Its width is jointly determined by the depth of f S and the downsampling parameter. For Scheduler Training, We sample 32 offline batches from the random sample pool P at each round t, which has a capacity of 40,000. We use the Adam optimizer for both f S and f S , and set the initial learning rate to 10 4. For Schedule Selection, we set the scheduling budget | e Xt| = 1 Baselines. Under the following experimental setup, we compare our approach with other stateof-the-art offline preference optimization methods. Among these, RRHF [93] and SLi C-HF [96] both utilize ranking losses. RRHF employs a length-normalized log-likelihood function, whereas SLi C-HF [96] directly uses the log-likelihood function and incorporates an SFT objective. IPO [5] is a theoretically grounded method that avoids DPO s assumption that pairwise preferences can be substituted with pointwise rewards. CPO [89] uses sequence likelihood as a reward and trains along the SFT objective. KTO [32] learns from non-paired preference data. ORPO [42] introduces a reference-model-free odd ratio term to directly contrast winning and losing responses with the policy model and jointly trains with the SFT objective. R-DPO [65] is an enhanced version of DPO that incorporates an additional regularization term to mitigate length exploitation. Preference Dataset Generation. To ensure fairness in comparisons, We adopt experimental settings that are currently widely used [60, 85, 42]. We utilize widely adopted instruction-tuned models as SFT models and employ the SFT model to generate five responses for each prompt x in the Ultra Feedback dataset [22]. Subsequently, a pretrained reward model serves as the annotator to directly assign a reward score r(x, yi) to each candidate response yi. We then select the two responses with the largest score difference yw = yargmax(r), yl = yargmin(r) to form a sample (x, yw, yl) in the preference dataset D. LLM Settings. We conduct experiments using two model settings. The first model setting employs mistralai/Mistral-7B-Instruct-v0.2 [46] and meta-llama/Meta-Llama-3-8B-Instruct [1] as SFT models, with llm-blender/Pair RM [47] serving as the reward model. The second model setting, which we refer to v0.2, employs meta-llama/Meta-Llama-3-8B-Instruct [1] and google/gemma-29b-it [77] as SFT models. We utilize the more powerful RLHFlow/Armo RM-Llama3-8B-v0.1 [82] as the reward model. Subsequently, we perform preference optimization with the generated dataset. Hyperparameters. We set the sampling temperature to 0.8 when generating responses with the SFT model. For DPO, we set β = 0.01, with a learning rate of 5 10 7 for Mistral-7B-Instruct-v0.2, 1 10 6 for Meta-Llama-3-8B-Instruct, and 3 10 7 for gemma2-9b-it. Evaluation Settings. We primarily evaluate our models using two widely adopted open-ended instruction-following benchmarks: MT-Bench [100] and Alpaca Eval 2 [29]. These benchmarks assess the models general conversational capabilities across diverse query sets, with specific configurations detailed in Table 4. All the training experiments in this paper were conducted on 8 A100 GPUs. |