Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Robust LLM Alignment via Distributionally Robust Direct Preference Optimization
Authors: Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical experiments using benchmark data sets and LLMs demonstrate the superior performance of WDPO and KLDPO in substantially improving the alignment when there is a preference distribution shift. |
| Researcher Affiliation | Collaboration | 1Texas A&M University, 2Tencent AI Lab, 3Google Deep Mind. Emails: EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 WDPO Algorithm Algorithm 2 KLDPO Algorithm |
| Open Source Code | Yes | We provide the code at https://github.com/The Black Cat22/ distributionally_robust_dpo. |
| Open Datasets | Yes | Our empirical experiments using benchmark data sets and LLMs demonstrate the superior performance of WDPO and KLDPO in substantially improving the alignment when there is a preference distribution shift. ...fine-tune LLa MA-3.2-1B/3B-Instruct and LLa MA-3.18B-Instruct models on prompts from the Help Steer2 dataset (Wang et al., 2024b) using preferences generated by the Armo RM reward model (Wang et al., 2024a), and evaluate them on distinct reward objectives from the Open LLM Leaderboard (Fourrier et al., 2024). We use the Emotion dataset (Saravia et al., 2018) to train a GPT-2 model (Radford et al., 2019) with a classification head for multi-label classification over five emotions: sadness, joy, love, anger, fear. |
| Dataset Splits | Yes | All models are trained on preference labels emphasizing the emotion fear, while evaluation preferences gradually shift toward anger. The left two plots correspond to convex mixing of these emotions, and the right two use geometric mixing. As expected, DPO performs best when the evaluation preference closely matches the training setup. However, as the evaluation shifts toward anger, DPO s performance degrades significantly. To simulate preference shift, evaluation is performed at mixing coefficients α = αo, where αo = 0.1 is used during training. We evaluate all models on five individual Armo RM objectives, three of which are unseen during training, to simulate preference shift. |
| Hardware Specification | Yes | Experiments were conducted on a single 40 GB A100 GPU, requiring gradient accumulation over two steps. Experiments were conducted on an 8x H100 GPU setup, requiring loading the model in bfloat16 and training with Deep Speed Ze RO-2 optimizer (Rajbhandari et al., 2020). |
| Software Dependencies | No | The model was trained using a sigmoid activation function and binary cross-entropy loss, adhering to the standard multilabel classification framework. Training was conducted over 8 epochs with a batch size of 64, utilizing the Adam optimizer with a learning rate of 5.0 10 5 and a weight decay of 0.01. The training used the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 5.0 10 7 following 12 warmup steps. Additionally, a maximum gradient norm of 10 was applied to stabilize the training. training with Deep Speed Ze RO-2 optimizer (Rajbhandari et al., 2020). |
| Experiment Setup | Yes | Training was conducted over 8 epochs with a batch size of 64, utilizing the Adam optimizer with a learning rate of 5.0 10 5 and a weight decay of 0.01. The model was trained for 10 epochs with a batch size of 64. The training used the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 5.0 10 7 following 12 warmup steps. Additionally, a maximum gradient norm of 10 was applied to stabilize the training. The model was trained for 40 epochs with an effective batch size of 64. We used Adam optimizer, with a learning rate of 5.0 10 7 following 12 warmup steps. A maximum gradient norm of 10 was applied to ensure stable training. The DPO beta parameter was set to 0.1 for all training runs. The models were trained for 8 epochs with an effective batch size of 128. We used Adam optimizer with a learning rate of 5.0 10 7 after 10% warmup ratio and then the learning rate was reduced using a cosine scheduler. The DPO beta parameter was set to 0.01 for all training runs. |