reproducibilityindex.ai

RLCD: Reinforcement Learning from Contrastive Distillation for LM Alignment

Authors: Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, Yuandong Tian

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, RLCD outperforms RLAIF (Bai et al., 2022b) and context distillation (Huang et al., 2022) baselines across three diverse alignment tasks harmlessness, helpfulness, and story outline generation and when using both 7B and 30B model scales for simulating preference data. Empirically, RLCD outperforms RLAIF (Bai et al., 2022b) and context distillation (Huang et al., 2022) baselines across three diverse alignment tasks harmlessness, helpfulness, and story outline generation and when using both 7B and 30B model scales for simulating preference data.
Researcher Affiliation	Collaboration	Kevin Yang1,2 Dan Klein2 Asli Celikyilmaz1 Nanyun Peng3 Yuandong Tian1 1Meta AI, 2UC Berkeley, 3UCLA
Pseudocode	No	The paper describes the method and fine-tuning procedure in text but does not include any pseudocode or algorithm blocks with labels like 'Algorithm' or 'Pseudocode'.
Open Source Code	Yes	Code and simulated preference data are available at https://github.com/facebookresearch/rlcd.
Open Datasets	Yes	Our harmlessness and helpfulness prompt sets are inspired by Bai et al. (2022a), and we use their training sets to derive the initial prompts for preference data simulation; each training set contains slightly over 40000 conversations.
Dataset Splits	Yes	For harmlessness and helpfulness, the validation set is the first 1000 examples from Anthropic s test data (e.g., https://github.com/anthropics/hh-rlhf/blob/master/harmless-base/test. jsonl.gz) and the test set is the second 1000 examples.
Hardware Specification	No	The paper mentions using LLa MA-7B and LLa MA-30B models and that they were loaded in 8-bit precision, but it does not specify the underlying hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	The paper states 'Our implementation is based on the Alpaca Farm codebase (Dubois et al., 2023)' and mentions 'PPO' but does not provide specific version numbers for any software, libraries, or frameworks used.
Experiment Setup	Yes	We optimize the training parameters for PPO, in particular the number of training steps and KL-regularization term. ...for all three tasks, we selected KL coefficients from among {0.001, 0.002, 0.004, 0.008, 0.016, 0.032} and a number of PPO steps from among {20, 40, 60, 80} using a grid search