Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scalable Valuation of Human Feedback through Provably Robust Model Alignment
Authors: Masahiro Fujisawa, Masaki Adachi, Michael A Osborne
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Hölder-DPO achieves state-of-the-art robust alignment performance while accurately detecting mislabels in controlled datasets. Finally, applied to Anthropic HH-RLHF dataset, it reveals substantial noise levels and removing these mislabels significantly improves alignment performance across methods. |
| Researcher Affiliation | Collaboration | Masahiro Fujisawa , ,1,2,4, Masaki Adachi ,2,3, Michael A. Osborne3 1 The University of Osaka, 2 Lattice Lab, Toyota Motor Corporation, 3 Machine Learning Research Group, University of Oxford, 4 RIKEN AIP |
| Pseudocode | Yes | Figure 6: Pseudocode for Hölder-DPO and DPO objectives |
| Open Source Code | Yes | The code is available1. 1https://github.com/ma921/Holder DPO |
| Open Datasets | Yes | We evaluate alignment robustness on a sentiment-controlled text generation task using the IMDb dataset [71]... For dataset, we evaluate on the Golden HH [15], a manually curated version of Anthropic HH [34]... We further evaluated Hölder-DPO on larger language models Mistral-8B [93] and Ne Mo-12B [94] both capable of multilingual interaction. Experiments were conducted on the OASST1 dataset [89, 59]... |
| Dataset Splits | Yes | To ensure clean dataset, we filter out preference pairs that do not satisfy r (ywin, x) r (ylose, x) > 0.1, yielding 12,000 clean pairs. Of these, 10,000 are used for training and 2,000 for evaluation... The dataset contains 42,500 training and 2,310 test examples. |
| Hardware Specification | Yes | All experiments are conducted on a single NVIDIA A100 GPU (40 GB VRAM, 83.48 GB RAM). |
| Software Dependencies | No | We implement all methods using the Transformers [96, 104], TRL [98], and Py Torch [80] libraries. |
| Experiment Setup | Yes | We set β = 0.1 and γ = 2.0, and all other hyperparameters follow the TRL defaults. Complete implementation details, including Hugging Face URLs to models and datasets used, are provided in Appendix H. ... Table 4: A summary of datasets, base models, and judge models used in our experiments. prompts max token length 512 temperature 0.25 top k 50 top p 0.95 repetition penalty 1.3 no repeat ngram size 4 SFT epoch 1 batch size 4 gradient accumulation 8 effective batch size 32 learning rate 5e-7 fp16 DPO epoch 3 batch size 4 gradient accumulation 8 effective batch size 32 learning rate 1e-6 fp16 DPO beta 0.1 optimizer Adam W PEFT epoch 3 quant type nf4 dtype bfloat16 lora alpha 16 lora dropout 0.1 r 32 target modules all linear optimizer Adam8bit learning rate 1e-6 |