Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF

Authors: Anand Siththaranjan, Cassidy Laidlaw, Dylan Hadfield-Menell

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results indicate that applying DPL to RLHF for LLM chatbots identifies hidden context in the data and significantly reduces subsequent jailbreak vulnerability. We validate DPL in two ways. First, we conduct a small-scale synthetic experiment with a 1dimensional space of alternatives that allows us to directly compare to Borda count. Next, we apply DPL to a real-world dataset of preferences for use in RLHF.
Researcher Affiliation Academia Anand Siththaranjan Cassidy Laidlaw University of California, Berkeley {anandsranjan,cassidy laidlaw}@ cs.berkeley.edu Dylan Hadfield-Menell Massachusetts Institute of Technology dhm@csail.mit.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our code and data are available at https://github.com/cassidylaidlaw/hidden-context.
Open Datasets Yes In order to evaluate the ability of DPL methods to identify hidden context, we use the HH-RLHF dataset (Bai et al., 2022a).
Dataset Splits No The paper uses the HH-RLHF dataset and mentions training models for certain epochs and batch sizes, but does not provide specific train/validation/test dataset splits (e.g., percentages or counts) or detailed methodology for splitting.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments, only mentioning the LLM model used (LLAMA-2-7B).
Software Dependencies No We implement training using Py Torch (Paszke et al., 2019) and Hugging Face Transformers (Wolf et al., 2020). While specific software is mentioned, version numbers for these libraries are not explicitly provided in the text, only references to their original papers.
Experiment Setup Yes We use the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 3 10 6 which is decayed via a cosine schedule to 3 10 7, a batch size of 2 comparisons (i.e., 4 responses total), and weight decay of 0.0001. Preference models trained on just the harmlessness or helpfulness subsets of the data are trained for 2 epochs, while preference models trained on the combined data are trained for 1 epoch; this ensures all models are trained for roughly the same number of gradient steps.