Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scalable Valuation of Human Feedback through Provably Robust Model Alignment

Authors: Masahiro Fujisawa, Masaki Adachi, Michael A Osborne

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Hölder-DPO achieves state-of-the-art robust alignment performance while accurately detecting mislabels in controlled datasets. Finally, applied to Anthropic HH-RLHF dataset, it reveals substantial noise levels and removing these mislabels significantly improves alignment performance across methods.
Researcher Affiliation Collaboration Masahiro Fujisawa , ,1,2,4, Masaki Adachi ,2,3, Michael A. Osborne3 1 The University of Osaka, 2 Lattice Lab, Toyota Motor Corporation, 3 Machine Learning Research Group, University of Oxford, 4 RIKEN AIP
Pseudocode Yes Figure 6: Pseudocode for Hölder-DPO and DPO objectives
Open Source Code Yes The code is available1. 1https://github.com/ma921/Holder DPO
Open Datasets Yes We evaluate alignment robustness on a sentiment-controlled text generation task using the IMDb dataset [71]... For dataset, we evaluate on the Golden HH [15], a manually curated version of Anthropic HH [34]... We further evaluated Hölder-DPO on larger language models Mistral-8B [93] and Ne Mo-12B [94] both capable of multilingual interaction. Experiments were conducted on the OASST1 dataset [89, 59]...
Dataset Splits Yes To ensure clean dataset, we filter out preference pairs that do not satisfy r (ywin, x) r (ylose, x) > 0.1, yielding 12,000 clean pairs. Of these, 10,000 are used for training and 2,000 for evaluation... The dataset contains 42,500 training and 2,310 test examples.
Hardware Specification Yes All experiments are conducted on a single NVIDIA A100 GPU (40 GB VRAM, 83.48 GB RAM).
Software Dependencies No We implement all methods using the Transformers [96, 104], TRL [98], and Py Torch [80] libraries.
Experiment Setup Yes We set β = 0.1 and γ = 2.0, and all other hyperparameters follow the TRL defaults. Complete implementation details, including Hugging Face URLs to models and datasets used, are provided in Appendix H. ... Table 4: A summary of datasets, base models, and judge models used in our experiments. prompts max token length 512 temperature 0.25 top k 50 top p 0.95 repetition penalty 1.3 no repeat ngram size 4 SFT epoch 1 batch size 4 gradient accumulation 8 effective batch size 32 learning rate 5e-7 fp16 DPO epoch 3 batch size 4 gradient accumulation 8 effective batch size 32 learning rate 1e-6 fp16 DPO beta 0.1 optimizer Adam W PEFT epoch 3 quant type nf4 dtype bfloat16 lora alpha 16 lora dropout 0.1 r 32 target modules all linear optimizer Adam8bit learning rate 1e-6