Statistical Rejection Sampling Improves Preference Optimization

Authors: Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J Liu, Jialu Liu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments across diverse tasks, we demonstrate that RSO consistently outperforms both SLi C and DPO as evaluated by gold reward, Large Language Models (LLMs) and human raters.
Researcher Affiliation Industry Google Research , Google Deep Mind
Pseudocode Yes Algorithm 1 Statistical Rejection Sampling Algorithm in Python
Open Source Code No The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets Yes We study RSO on Reddit TL;DR summarization (Stiennon et al., 2020) and Anthropic HH dialogue (Bai et al., 2022) datasets.
Dataset Splits Yes The Reddit TL;DR summarization dataset contains both finetune data Dtldr sft and human feedback data Dtldr hf . Dtldr sft contains 117k/6k/6k examples in train, validation and test splits.
Hardware Specification No The paper mentions models like 'T5-large (770M)' and 'T5-XXL (11B)' but does not specify the exact hardware (e.g., specific GPU models, CPU types, or TPU versions) used for experiments.
Software Dependencies No The paper mentions specific models (T5-large, T5-XXL) and an optimizer (Adafactor) and provides a Python implementation of an algorithm, but it does not list specific software dependencies with their version numbers (e.g., Python version, library versions like PyTorch, TensorFlow, or scikit-learn).
Experiment Setup Yes Unless specifically mentioned, we set β = 0.5 and γ = 0.05. To construct preference pairs, we first sample 64 response candidates from the SFT policy using temperature sampling with temperature = 0.7 and top k = 40. Then we sub-sample 8 samples. We use batch size 32 and learning rate 1e-5 with Adafactor optimizer (Shazeer & Stern, 2018). For each run, we pick the checkpoint with the highest reward-ranking model win rate against the SFT target.