Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Leveraging robust optimization for llm alignment under distribution shifts

Authors: Mingye Zhu, Yi Liu, Zheren Fu, Yongdong Zhang, Zhendong Mao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experiments Models and Datasets. We validate the proposed method with two base models: Mistral-7B-v0.1, Llama-3.1-8B, on three widely used datasets in alignment literature: HH-RLHF, Summarization and the Ultra Feedback datasets. 4.1 Experimental Results 4.2 Ablations 4.3 Generalization and Robustness Evaluation
Researcher Affiliation	Collaboration	1University of Science and Technology of China, Hefei, China 2State Key Laboratory of Communication Content Cognition, People s Daily Online, Beijing, China
Pseudocode	Yes	Algorithm 1 Do RA Optimization Algorithm
Open Source Code	No	Answer: [No] Justification: We are finalizing the codebase and will release soon once we finished.
Open Datasets	Yes	Models and Datasets. We validate the proposed method with two base models: Mistral-7B-v0.1, Llama-3.1-8B, on three widely used datasets in alignment literature: HH-RLHF, Summarization and the Ultra Feedback datasets. All the datasets are subject to the terms of the MIT License, except for the Alpaca Eval benchmark which is subject to the Apache-2.0 license.
Dataset Splits	Yes	For HH and Summarization tasks, we adapt the evaluation prompts from Rafailov et al. [35] using GPT-4o, and compute the win rates and lose rates with 400 randomly selected test queries, with the order randomly swapped. 1.pairwise preference setting where we leverage the original pairwise data; 2. listwise preference setting, where we augment the original pairwise data with 2 additional synthetic responses from Mistral-7B-Instruct-v0.3, leading to 4 responses in total for each query.
Hardware Specification	Yes	All experiments are run on 80GB A100 GPUs with a batch size of 32.
Software Dependencies	No	We begin by performing SFT on the selected responses for each task, following the default hyperparameter configurations from the DPO codebase. All experiments are run on 80GB A100 GPUs with a batch size of 32.
Experiment Setup	Yes	Detailed hyperparameter configurations and additional training settings are provided in Appendix C. General training settings. We begin by performing SFT on the selected responses for each task, following the default hyperparameter configurations from the DPO codebase. All experiments are run on 80GB A100 GPUs with a batch size of 32. For pairwise training, we adopt a learning rate of 3e-7 for the Mistral-7B base model and 6e-7 for the LLama-8B base model, in line with Meng et al. [31]. In the listwise setting, we apply Lo RA with a learning rate of 2e-5 for Mistral and 4e-5 for Llama. For Ultra Feedback, we default to full fine-tuning. For classifier training, we set the learning rate to 2e-5 and train for 3 epochs. Table 8: Hyperparameters for different baselines and tasks. Decoding hyperparameters. We adopt a fixed sampling strategy across all experiments to ensure consistency in response generation. Specifically, we set the temperature to 0.8, top-k to 50, and top-p to 0.9 during sampling. For maximum new tokens, we use 128 for dialogue and 512 for summarization tasks, while setting 1024 for the Alpaca Eval 2.0 benchmark and 4096 for Arena-Hard bench.