Provably Robust DPO: Aligning Language Models with Noisy Feedback
Authors: Sayak Ray Chowdhury, Anush Kini, Nagarajan Natarajan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on IMDb sentiment generation and Anthropic s helpful-harmless dataset shows that r DPO is robust to noise in preference labels compared to vanilla DPO and other heuristics proposed by practitioners. |
| Researcher Affiliation | Industry | 1Microsoft Research, India. |
| Pseudocode | Yes | Py Torch code for the Robust DPO loss is provided below. |
| Open Source Code | Yes | All artifacts are made available at https://aka.ms/Robust DPO. |
| Open Datasets | Yes | Our experiments on IMDb sentiment generation and Anthropic s helpful-harmless dataset shows that r DPO is robust to noise in preference labels compared to vanilla DPO and other heuristics proposed by practitioners. |
| Dataset Splits | No | The paper mentions a training set and an evaluation set: "This resulted in a dataset with 12000 preference triplets of which 10000 were used to train the policy, and 2000 for evaluation." However, it does not explicitly define a separate 'validation' split or its size/purpose. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, memory, or cloud instance types used for running the experiments. |
| Software Dependencies | No | The paper mentions "Py Torch code" and |
| Experiment Setup | Yes | For methods in the DPO family (vanilla DPO, r DPO, c DPO), we optimized the policy for 1000 steps with batch size 16. Table 4. Hyperparameters used for methods in the DPO Family Parameter Value beta 0.1 learning rate 0.001 batch size 16 max length 512 max prompt length 128. Table 5. Hyperparameters used for methods in the PPO Family Model Parameter Value Reward Model learning rate 1.41 x 10 5 batch size 16 PPO learning rate 1.41 x 10 5 batch size 16 |