Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Importance Weighting for Aligning Language Models under Deployment Distribution Shift
Authors: Thanawat Lodkaew, Tongtong Fang, Takashi Ishida, Masashi Sugiyama
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results on various distribution shift scenarios demonstrate the usefulness of IW-DPO. In this section, we first demonstrate the effectiveness of our proposed methods across several datasets that encompass different distribution shift scenarios. Additionally, we compare our methods against WPO (Zhou et al., 2024). Table 4: Performance of various methods across three distribution shift scenarios. |
| Researcher Affiliation | Academia | Thanawat Lodkaew EMAIL The University of Tokyo, Japan Tongtong Fang EMAIL The Institute of Statistical Mathematics, Japan Takashi Ishida EMAIL RIKEN, Japan The University of Tokyo, Japan Masashi Sugiyama EMAIL RIKEN, Japan The University of Tokyo, Japan |
| Pseudocode | Yes | Algorithm 1 IW-DPO 1: Finish warmup phase 2: Define t as l DPO (for IW-DPO-L) or ˆr (for IW-DPO-R) 3: Define the batch sizes NBtr and NBv 4: Define the number of training epochs E 5: for e = 1 to E do 6: for Batch Btr = xtr,i, ytr,i 1 , ytr,i 2 , btr,i NBtr i=1 i.i.d. Dtr do 7: Sample batch Bv = xv,i, yv,i 1 , yv,i 2 , bv,i NBv i=1 i.i.d. Dv 8: Obtain Ztr with respect to Btr and Zv with respect to Bv 9: Estimate w with Ztr and Zv as inputs 10: Obtain ˆw by normalizing w 11: Obtain per-instance losses [ℓtr,1 DPO, . . . , ℓ tr,NBtr DPO ] 12: Obtain ˆ J by reweighting the per-instance losses with ˆw 13: Compute the gradients with ˆ J 14: Update the model parameters using the computed gradients 15: end for 16: end for |
| Open Source Code | No | The text does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository. The paper mentions being "Reviewed on Open Review: https: // openreview. net/ forum? id= C7QWN4AXvp" but this is for peer review, not code release. |
| Open Datasets | Yes | We employ the Safe RLHF dataset, where each instance contains a question and a pair of responses. In addition to preference labels based on helpfulness, the Safe RLHF dataset (Dai et al., 2024; Ji et al., 2023) includes a safety label for each response... The SHP dataset (Ethayarajh et al., 2022) consists of questions and responses from 18 different domains... The CALI dataset (Huang & Yang, 2023) contains premises, hypotheses, and labels... The URLs are https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF for Safe RLHF, https://huggingface.co/datasets/stanfordnlp/SHP for SHP, ... for CALI |
| Dataset Splits | Yes | We further divide the Helpful-Harmless set into three sets: Helpful-Harmless training set, Helpful-Harmless validation set, and Helpful-Harmless test set. We then create the training dataset Dtr by combining the Helpful-Harmful set and the Helpful-Harmless training set. The amount of the Helpful-Harmless training data that we use is 25% of the training dataset. While the Helpful-Harmless validation set is used as the validation dataset Dv, the Helpful-Harmless test set is used as Dte for evaluation. Dv is fifty times smaller than Dtr. Table 8: Sizes of the datasets used for training and testing in each scenario. |
| Hardware Specification | No | The paper discusses various language models (e.g., Llama 3.1-8B-Instruct, Pythia-1.4B, Gemma 2-9B, Pythia-2.8B, Gemma 2-2B) used in experiments, but it does not specify the underlying hardware (e.g., GPU models, CPU types, or memory) used to conduct these experiments. |
| Software Dependencies | No | The paper does not explicitly state any specific software dependencies with version numbers (e.g., 'Python 3.8', 'PyTorch 1.9'). |
| Experiment Setup | Yes | For details on hyperparameter tuning for DPO, IW-DPO-L, and IW-DPO-R, please refer to Appendix B.1. See Appendix B.2 for the number of instances for the training, validation, and test sets. Table 7: Default hyperparameter settings. Hyperparameter DPO IW-DPO-L IW-DPO-R β (for Eq. (5)) 0.1 0.1 0.1 λ (for Eq. (9)) 0.1 0.1 γ (for RBF) 0.1 0.1 warmup_examples 1024 1024 |