RLCD: Reinforcement Learning from Contrastive Distillation for LM Alignment

Authors: Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, Yuandong Tian

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, RLCD outperforms RLAIF (Bai et al., 2022b) and context distillation (Huang et al., 2022) baselines across three diverse alignment tasks harmlessness, helpfulness, and story outline generation and when using both 7B and 30B model scales for simulating preference data. Empirically, RLCD outperforms RLAIF (Bai et al., 2022b) and context distillation (Huang et al., 2022) baselines across three diverse alignment tasks harmlessness, helpfulness, and story outline generation and when using both 7B and 30B model scales for simulating preference data.
Researcher Affiliation Collaboration Kevin Yang1,2 Dan Klein2 Asli Celikyilmaz1 Nanyun Peng3 Yuandong Tian1 1Meta AI, 2UC Berkeley, 3UCLA
Pseudocode No The paper describes the method and fine-tuning procedure in text but does not include any pseudocode or algorithm blocks with labels like 'Algorithm' or 'Pseudocode'.
Open Source Code Yes Code and simulated preference data are available at https://github.com/facebookresearch/rlcd.
Open Datasets Yes Our harmlessness and helpfulness prompt sets are inspired by Bai et al. (2022a), and we use their training sets to derive the initial prompts for preference data simulation; each training set contains slightly over 40000 conversations.
Dataset Splits Yes For harmlessness and helpfulness, the validation set is the first 1000 examples from Anthropic s test data (e.g., https://github.com/anthropics/hh-rlhf/blob/master/harmless-base/test. jsonl.gz) and the test set is the second 1000 examples.
Hardware Specification No The paper mentions using LLa MA-7B and LLa MA-30B models and that they were loaded in 8-bit precision, but it does not specify the underlying hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper states 'Our implementation is based on the Alpaca Farm codebase (Dubois et al., 2023)' and mentions 'PPO' but does not provide specific version numbers for any software, libraries, or frameworks used.
Experiment Setup Yes We optimize the training parameters for PPO, in particular the number of training steps and KL-regularization term. ...for all three tasks, we selected KL coefficients from among {0.001, 0.002, 0.004, 0.008, 0.016, 0.032} and a number of PPO steps from among {20, 40, 60, 80} using a grid search