Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Diverse Preference Learning for Capabilities and Alignment
Authors: Stewart Slocum, Asher Parker-Sartori, Dylan Hadfield-Menell
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we consider four experimental settings for evaluating our algorithm. First, we show that in general-purpose chat domains, SPL allows for increased diversity with less performance degradation than DPO with token-level temperature scaling. Second, we consider an application of high-temperature generation in best-of-N problem-solving settings. Finally, we evaluate SPL s logit calibration, finding reduced overconfidence and improved calibration on standard multiple-choice benchmarks. Figure 2: Improved diversity-quality tradeoffs with SPL. |
| Researcher Affiliation | Academia | Stewart Slocum, Asher Parker-Sartori, and Dylan Hadfield-Menell MIT CSAIL EMAIL |
| Pseudocode | No | The paper describes its methodology through mathematical formulations and theoretical analysis in Section 3 and Appendix A, but does not include any explicitly labeled pseudocode or algorithm blocks with structured, step-by-step procedures. |
| Open Source Code | No | The paper does not explicitly state that source code is provided, nor does it include any links to a code repository. |
| Open Datasets | Yes | We train on the HH-RLHF preference dataset for 5,000 steps (details in Appendix C.4) (Bai et al., 2022). For these experiments, we Lo RA finetune Mistral-7B-Instruct-v0.2 with DPO and SPL (Rafailov et al., 2024; Hu et al., 2021). We apply DPO and SPL to a Mistral-7B base model (Hugging Face, 2023) trained with supervised fine-tuning on the Ultra Chat dataset (Ding et al., 2023). This approach trains on the Ultrafeedback-200k dataset, a large preference dataset covering a broad suite of chat and reasoning tasks (Cui et al., 2024). We evaluate against two mathematical reasoning datasets: the GSM8K grade-school math dataset (Cobbe et al., 2021) and the more challenging MATH dataset (Hendrycks et al., 2021b). We evaluate against two standard multiple-choice datasets: Truthful QA and MMLU. Truthful QA is a benchmark designed to assess a model s ability to provide truthful answers in contexts where misconceptions are prevalent (Lin et al., 2022). MMLU tests a model s knowledge and reasoning across 57 diverse subjects, from philosophy to abstract algebra (Hendrycks et al., 2021a). |
| Dataset Splits | Yes | At inference time, we sample 500 prompts from a held-out validation split of HH-RLHF that neither the language models nor the reward model were trained on. We sample 128 completions on a random split of 200 problems from each dataset. We also divide problems into Easy, Medium, and Hard categories. For MATH, this corresponds to Level 1, Level 3, and Level 5 problems. For GSM8K, we run our evaluation on Mistral-Instruct-7B and group problems as easy if they take 4 or fewer samples to solve, medium if they take 5-64 samples to solve, and hard if they take more than 64 samples to solve. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running its experiments. |
| Software Dependencies | No | The paper mentions several models and tools like Mistral-7B-Instruct-v0.2, Sentence-BERT-Large, Open AI's text-embedding-3-small, and gpt-4o-mini-2024-07-18, but does not provide specific versions for ancillary software libraries or frameworks (e.g., Python, PyTorch, Transformers). |
| Experiment Setup | Yes | For both DPO and SPL, we Lo RA finetune Mistral-7B-Instruct-v0.2 on HH-RLHF for 5,000 steps with batch size 8. We use Lo RA rank r Lo RA = 16, regularization αLo RA = 16, and dropout p Lo RA = 0.05. We use learning rate 1e 5, 150 warmup steps, and max conversation length of 512 tokens. For all runs, we use regularization parameter β = 0.1. For both DPO and SPL, our base model is a Mistral-7B base model that has been full-parameter supervised fine-tuned on the Ultra Chat dataset. We then Lo RA-finetune this model on Ultrafeedback-200k for one epoch. We use batch size 4, Lo RA rank r Lo RA = 64, regularization αLo RA = 64, and dropout p Lo RA = 0.05. We use learning rate 1e 5, 150 warmup steps, and max conversation length of 1024 tokens. |