reproducibilityindex.ai

Provably Robust DPO: Aligning Language Models with Noisy Feedback

Authors: Sayak Ray Chowdhury, Anush Kini, Nagarajan Natarajan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on IMDb sentiment generation and Anthropic s helpful-harmless dataset shows that r DPO is robust to noise in preference labels compared to vanilla DPO and other heuristics proposed by practitioners.
Researcher Affiliation	Industry	1Microsoft Research, India.
Pseudocode	Yes	Py Torch code for the Robust DPO loss is provided below.
Open Source Code	Yes	All artifacts are made available at https://aka.ms/Robust DPO.
Open Datasets	Yes	Our experiments on IMDb sentiment generation and Anthropic s helpful-harmless dataset shows that r DPO is robust to noise in preference labels compared to vanilla DPO and other heuristics proposed by practitioners.
Dataset Splits	No	The paper mentions a training set and an evaluation set: "This resulted in a dataset with 12000 preference triplets of which 10000 were used to train the policy, and 2000 for evaluation." However, it does not explicitly define a separate 'validation' split or its size/purpose.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, memory, or cloud instance types used for running the experiments.
Software Dependencies	No	The paper mentions "Py Torch code" and
Experiment Setup	Yes	For methods in the DPO family (vanilla DPO, r DPO, c DPO), we optimized the policy for 1000 steps with batch size 16. Table 4. Hyperparameters used for methods in the DPO Family Parameter Value beta 0.1 learning rate 0.001 batch size 16 max length 512 max prompt length 128. Table 5. Hyperparameters used for methods in the PPO Family Model Parameter Value Reward Model learning rate 1.41 x 10 5 batch size 16 PPO learning rate 1.41 x 10 5 batch size 16