Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Importance Weighting for Aligning Language Models under Deployment Distribution Shift

Authors: Thanawat Lodkaew, Tongtong Fang, Takashi Ishida, Masashi Sugiyama

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results on various distribution shift scenarios demonstrate the usefulness of IW-DPO. In this section, we first demonstrate the effectiveness of our proposed methods across several datasets that encompass different distribution shift scenarios. Additionally, we compare our methods against WPO (Zhou et al., 2024). Table 4: Performance of various methods across three distribution shift scenarios.
Researcher Affiliation	Academia	Thanawat Lodkaew EMAIL The University of Tokyo, Japan Tongtong Fang EMAIL The Institute of Statistical Mathematics, Japan Takashi Ishida EMAIL RIKEN, Japan The University of Tokyo, Japan Masashi Sugiyama EMAIL RIKEN, Japan The University of Tokyo, Japan
Pseudocode	Yes	Algorithm 1 IW-DPO 1: Finish warmup phase 2: Define t as l DPO (for IW-DPO-L) or ˆr (for IW-DPO-R) 3: Define the batch sizes NBtr and NBv 4: Define the number of training epochs E 5: for e = 1 to E do 6: for Batch Btr = xtr,i, ytr,i 1 , ytr,i 2 , btr,i NBtr i=1 i.i.d. Dtr do 7: Sample batch Bv = xv,i, yv,i 1 , yv,i 2 , bv,i NBv i=1 i.i.d. Dv 8: Obtain Ztr with respect to Btr and Zv with respect to Bv 9: Estimate w with Ztr and Zv as inputs 10: Obtain ˆw by normalizing w 11: Obtain per-instance losses [ℓtr,1 DPO, . . . , ℓ tr,NBtr DPO ] 12: Obtain ˆ J by reweighting the per-instance losses with ˆw 13: Compute the gradients with ˆ J 14: Update the model parameters using the computed gradients 15: end for 16: end for
Open Source Code	No	The text does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository. The paper mentions being "Reviewed on Open Review: https: // openreview. net/ forum? id= C7QWN4AXvp" but this is for peer review, not code release.
Open Datasets	Yes	We employ the Safe RLHF dataset, where each instance contains a question and a pair of responses. In addition to preference labels based on helpfulness, the Safe RLHF dataset (Dai et al., 2024; Ji et al., 2023) includes a safety label for each response... The SHP dataset (Ethayarajh et al., 2022) consists of questions and responses from 18 different domains... The CALI dataset (Huang & Yang, 2023) contains premises, hypotheses, and labels... The URLs are https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF for Safe RLHF, https://huggingface.co/datasets/stanfordnlp/SHP for SHP, ... for CALI
Dataset Splits	Yes	We further divide the Helpful-Harmless set into three sets: Helpful-Harmless training set, Helpful-Harmless validation set, and Helpful-Harmless test set. We then create the training dataset Dtr by combining the Helpful-Harmful set and the Helpful-Harmless training set. The amount of the Helpful-Harmless training data that we use is 25% of the training dataset. While the Helpful-Harmless validation set is used as the validation dataset Dv, the Helpful-Harmless test set is used as Dte for evaluation. Dv is fifty times smaller than Dtr. Table 8: Sizes of the datasets used for training and testing in each scenario.
Hardware Specification	No	The paper discusses various language models (e.g., Llama 3.1-8B-Instruct, Pythia-1.4B, Gemma 2-9B, Pythia-2.8B, Gemma 2-2B) used in experiments, but it does not specify the underlying hardware (e.g., GPU models, CPU types, or memory) used to conduct these experiments.
Software Dependencies	No	The paper does not explicitly state any specific software dependencies with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9').
Experiment Setup	Yes	For details on hyperparameter tuning for DPO, IW-DPO-L, and IW-DPO-R, please refer to Appendix B.1. See Appendix B.2 for the number of instances for the training, validation, and test sets. Table 7: Default hyperparameter settings. Hyperparameter DPO IW-DPO-L IW-DPO-R β (for Eq. (5)) 0.1 0.1 0.1 λ (for Eq. (9)) 0.1 0.1 γ (for RBF) 0.1 0.1 warmup_examples 1024 1024