RLCD: Reinforcement Learning from Contrastive Distillation for LM Alignment
Authors: Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, Yuandong Tian
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, RLCD outperforms RLAIF (Bai et al., 2022b) and context distillation (Huang et al., 2022) baselines across three diverse alignment tasks harmlessness, helpfulness, and story outline generation and when using both 7B and 30B model scales for simulating preference data. Empirically, RLCD outperforms RLAIF (Bai et al., 2022b) and context distillation (Huang et al., 2022) baselines across three diverse alignment tasks harmlessness, helpfulness, and story outline generation and when using both 7B and 30B model scales for simulating preference data. |
| Researcher Affiliation | Collaboration | Kevin Yang1,2 Dan Klein2 Asli Celikyilmaz1 Nanyun Peng3 Yuandong Tian1 1Meta AI, 2UC Berkeley, 3UCLA |
| Pseudocode | No | The paper describes the method and fine-tuning procedure in text but does not include any pseudocode or algorithm blocks with labels like 'Algorithm' or 'Pseudocode'. |
| Open Source Code | Yes | Code and simulated preference data are available at https://github.com/facebookresearch/rlcd. |
| Open Datasets | Yes | Our harmlessness and helpfulness prompt sets are inspired by Bai et al. (2022a), and we use their training sets to derive the initial prompts for preference data simulation; each training set contains slightly over 40000 conversations. |
| Dataset Splits | Yes | For harmlessness and helpfulness, the validation set is the first 1000 examples from Anthropic s test data (e.g., https://github.com/anthropics/hh-rlhf/blob/master/harmless-base/test. jsonl.gz) and the test set is the second 1000 examples. |
| Hardware Specification | No | The paper mentions using LLa MA-7B and LLa MA-30B models and that they were loaded in 8-bit precision, but it does not specify the underlying hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper states 'Our implementation is based on the Alpaca Farm codebase (Dubois et al., 2023)' and mentions 'PPO' but does not provide specific version numbers for any software, libraries, or frameworks used. |
| Experiment Setup | Yes | We optimize the training parameters for PPO, in particular the number of training steps and KL-regularization term. ...for all three tasks, we selected KL coefficients from among {0.001, 0.002, 0.004, 0.008, 0.016, 0.032} and a number of PPO steps from among {20, 40, 60, 80} using a grid search |