Provably Robust DPO: Aligning Language Models with Noisy Feedback

Authors: Sayak Ray Chowdhury, Anush Kini, Nagarajan Natarajan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on IMDb sentiment generation and Anthropic s helpful-harmless dataset shows that r DPO is robust to noise in preference labels compared to vanilla DPO and other heuristics proposed by practitioners.
Researcher Affiliation Industry 1Microsoft Research, India.
Pseudocode Yes Py Torch code for the Robust DPO loss is provided below.
Open Source Code Yes All artifacts are made available at https://aka.ms/Robust DPO.
Open Datasets Yes Our experiments on IMDb sentiment generation and Anthropic s helpful-harmless dataset shows that r DPO is robust to noise in preference labels compared to vanilla DPO and other heuristics proposed by practitioners.
Dataset Splits No The paper mentions a training set and an evaluation set: "This resulted in a dataset with 12000 preference triplets of which 10000 were used to train the policy, and 2000 for evaluation." However, it does not explicitly define a separate 'validation' split or its size/purpose.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, memory, or cloud instance types used for running the experiments.
Software Dependencies No The paper mentions "Py Torch code" and
Experiment Setup Yes For methods in the DPO family (vanilla DPO, r DPO, c DPO), we optimized the policy for 1000 steps with batch size 16. Table 4. Hyperparameters used for methods in the DPO Family Parameter Value beta 0.1 learning rate 0.001 batch size 16 max length 512 max prompt length 128. Table 5. Hyperparameters used for methods in the PPO Family Model Parameter Value Reward Model learning rate 1.41 x 10 5 batch size 16 PPO learning rate 1.41 x 10 5 batch size 16