Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling

Authors: Nguyen Phuc, Ngoc-Hieu Nguyen, Duy M. H. Nguyen, Anji Liu, An Mai, Thanh Binh Nguyen, Daniel Sonntag, Khoa D Doan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we empirically evaluate IS-DAAs ability to align language models with human preferences and mitigate the reward over-optimization problem. First, in the TL;DR Summarization task, we systematically study the trade-off between the policy performance and KL regularization achieved by different alignment methods in a controlled environment where we assume to have access to a golden reward model as the ground-truth preferences. Next, in the Instruction Following benchmark, we evaluate IS-DAAs on three standard open-ended instruction following benchmarks.
Researcher Affiliation Academia Phuc Minh Nguyen1, Ngoc-Hieu Nguyen1, Duy H. M. Nguyen3,7,9, Anji Liu3,4, An Mai5, Binh T. Nguyen6, Daniel Sonntag7,8, Khoa D. Doan1,2 1College of Engineering and Computer Science, Vin University 2Vin Uni-Illinois Smart Health Center, Vin University 3University of Stuttgart, 4National University of Singapore 5International University VNUHCM, 6University of Science VNUHCM 7German Research Center for Artificial Intelligence (DFKI), 8Oldenburg University 9Max Planck Research School for Intelligent Systems (IMPRS-IS)
Pseudocode No The paper describes methods and derivations mathematically but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Our code is available at https://github.com/mail-research/AIS-Sampling4DAAs.
Open Datasets Yes For the summarization task, we consider the filtered version of Reddit TL;DR summarization dataset [Stiennon et al., 2020]. For the instruction following task, we consider the Anthropic Helpful and Harmless (HH) dataset [Bai et al., 2022], Ultra Feedback dataset [] for preference trainining.
Dataset Splits No SFT Training For the summarization task, we use the SFT split of Reddit TL;DR summarization. For Anthropic-HH we use the chosen responses from the preference dataset for SFT stage. We pool together both datasets into a single SFT dataset. For TL;DR summarization dataset, we train all methods for 2 epochs. We sample 2 completions per prompt from the learned policy with 512 prompts from the evaluation set. For GPT-4 evaluation, we sample 256 prompts from the evaluation set.
Hardware Specification Yes We train and evaluate our models using NVIDIA 4x H100 GPUs. We use 4 A100-80GB GPUs, a batch size of 16, and LLa MA-3.2-3B as the base model
Software Dependencies No The paper mentions "Llama-3.2-3B [Meta AI, 2024a,b]" as the pre-trained base model, but does not specify other software libraries or their version numbers, such as Python or PyTorch versions.
Experiment Setup Yes Across all SFT and Preference training, we use a global batch size of 64 (with 4 gradient accumulation steps), and Adam W optimizer with a learning rate of 1 10 6 (cosine learning rate scheduler warmup for 100 steps) and a max length of 640. For TL;DR summarization dataset, we train all methods for 2 epochs. To evaluate the efficiency of addressing the over-optimization problem, we vary the regularization strength β {0.01, 0.05, 0.1}. For Anthropic-HH, We consider 1 epoch of training with β = 0.05 as standard configurations for offline alignments [Rafailov et al., 2024, Gao et al., 2024, Guo et al., 2024].