Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Can DPO Learn Diverse Human Values? A Theoretical Scaling Law

Authors: Shawn Im, Sharon Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper introduces a new theoretical framework to analyze how generalization scales with value diversity and sample quantity in models trained with direct preference optimization. Our framework rigorously assesses how well models generalize after a finite number of gradient steps, reflecting realworld LLM training practices. By analyzing the reward margin associated with each sample and its trajectory throughout training, we provide a bound on the generalization error that demonstrates the challenges of effectively learning a wide set of concepts or values. These insights are empirically validated on contemporary LLMs, underscoring the practical relevance of our theory.
Researcher Affiliation Academia Shawn Im Sharon Li Department of Computer Sciences University of Wisconsin-Madison EMAIL
Pseudocode No The paper does not contain explicit pseudocode or algorithm blocks. The methods are described through mathematical formulations and textual descriptions.
Open Source Code Yes 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We have open-sourced the code here.
Open Datasets Yes To ground our theoretical analysis, we begin with a concrete example from the Anthropic s persona dataset [40], which encompasses diverse types of human values.
Dataset Splits Yes For each persona, we randomly sample a subset of 90% of the statements for training, and use the remaining 10% for testing.
Hardware Specification Yes Software and hardware. We train with 4 A100 80GB GPUs using the TRL library [104] and Huggingface library [105] for full fine-tuning, generate embeddings with the Huggingface library and 1 A100 80GB GPU, and perform last-layer training on 1 A100 80GB GPU.
Software Dependencies No The paper mentions 'TRL library [104]' and 'Huggingface library [105]' but does not provide specific version numbers for these software components.
Experiment Setup Yes Training setup. For all full fine-tuning training runs, we use the Adam W optimizer with a learning rate of 10 5 for Llama models and 10 6.5 for Qwen and Mistral with no warm-up steps and a constant learning rate. We train on 4 GPUs with a batch size of 32 per device. For last-layer training runs, we use the Adam optimizer with a learning rate of 1e-3. For all experiments, we use β = 0.01.