Transforming and Combining Rewards for Aligning Large Language Models
Authors: Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alexander Nicholas D’Amour, Sanmi Koyejo, Victor Veitch
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments aligning language models to be both helpful and harmless using RLHF show substantial improvements over the baseline (non-transformed) approach. |
| Researcher Affiliation | Collaboration | 1University of Chicago, Chicago, IL, USA 2Google Research, Mountain View, CA, USA 3Google Deep Mind, Mountain View, CA, USA 4Stanford University, Stanford, CA, USA. |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | The paper does not include an unambiguous statement about releasing source code for the methodology described, nor does it provide a direct link to a source-code repository. |
| Open Datasets | Yes | Datasets We use the Anthropic Helpfulness and Harmlessness datasets (Bai et al., 2022). |
| Dataset Splits | Yes | For both tasks, we split the training set into two: half for training the reward model, and half for the alignment step. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used for running its experiments (e.g., GPU models, CPU types, or cloud instance specifications). |
| Software Dependencies | No | The paper mentions models like T5-base and PALM-2-XXS but does not provide specific software dependencies (e.g., libraries, frameworks) along with their version numbers. |
| Experiment Setup | Yes | We use Proximal Policy Optimization (PPO) to perform RLHF alignment. The specific hyperparameters are in Table 1 Parameter Value Policy learning rate 5 10 6 Value learning rate 4 10 5 Learning schedule Constant (linear warm-up) Training steps 20000 Warm-up steps 2000 Batch size 32 Input length 1024 Output length 256 |