Statistical Rejection Sampling Improves Preference Optimization
Authors: Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J Liu, Jialu Liu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments across diverse tasks, we demonstrate that RSO consistently outperforms both SLi C and DPO as evaluated by gold reward, Large Language Models (LLMs) and human raters. |
| Researcher Affiliation | Industry | Google Research , Google Deep Mind |
| Pseudocode | Yes | Algorithm 1 Statistical Rejection Sampling Algorithm in Python |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We study RSO on Reddit TL;DR summarization (Stiennon et al., 2020) and Anthropic HH dialogue (Bai et al., 2022) datasets. |
| Dataset Splits | Yes | The Reddit TL;DR summarization dataset contains both finetune data Dtldr sft and human feedback data Dtldr hf . Dtldr sft contains 117k/6k/6k examples in train, validation and test splits. |
| Hardware Specification | No | The paper mentions models like 'T5-large (770M)' and 'T5-XXL (11B)' but does not specify the exact hardware (e.g., specific GPU models, CPU types, or TPU versions) used for experiments. |
| Software Dependencies | No | The paper mentions specific models (T5-large, T5-XXL) and an optimizer (Adafactor) and provides a Python implementation of an algorithm, but it does not list specific software dependencies with their version numbers (e.g., Python version, library versions like PyTorch, TensorFlow, or scikit-learn). |
| Experiment Setup | Yes | Unless specifically mentioned, we set β = 0.5 and γ = 0.05. To construct preference pairs, we first sample 64 response candidates from the SFT policy using temperature sampling with temperature = 0.7 and top k = 40. Then we sub-sample 8 samples. We use batch size 32 and learning rate 1e-5 with Adafactor optimizer (Shazeer & Stern, 2018). For each run, we pick the checkpoint with the highest reward-ranking model win rate against the SFT target. |