ARGS: Alignment as Reward-Guided Search

Authors: Maxim Khanov, Jirayu Burapacheep, Yixuan Li

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section presents empirical experiments to evaluate the effectiveness of our proposed method. In particular, we aim to show that ARGS can effectively guide the outputs of the neural language model in alignment with human preference, such as helpfulness and harmfulness. All of our experiments are based on open-sourced language models and datasets.
Researcher Affiliation Academia Maxim Khanov1*, Jirayu Burapacheep2*, Yixuan Li1 University of Wisconsin-Madison1 Stanford University2 mkhanov@wisc.edu, jirayu@stanford.edu, sharonli@cs.wisc.edu
Pseudocode Yes Algorithm 1 ARGS-greedy Input: Previous context x with n tokens, number of candidates k, reward coefficient w, desired number of tokens m, base model LM, and reward model Output: A generated sequence with m tokens 1: for t n to m 1 do 2: V (k) top-k tokens with highest likelihood 3: for v V (k) do Iterate over top-k candidates 4: reward r([x, v]) Compute a reward of this candidate 5: scores(v) LM(v | x) + w reward 6: end for 7: vselected arg maxv V (k) scores(v) Select token 8: x [x, vselected] 9: end for 10: return x
Open Source Code Yes Code is publicly available at: https://github.com/deeplearning-wisc/args.
Open Datasets Yes To evaluate the performance of our approach, we employ ARGS on the HH-RLHF (Helpful and Harmless) dataset (Bai et al., 2022), which is the most commonly adopted benchmark for alignment. The dataset consists of 112,000 training samples and 12,500 test samples and is publicly available*. *https://huggingface.co/datasets/Dahoas/full-hh-rlhf
Dataset Splits No The trained reward model attains a final accuracy of 74.58% on the validation set. [...] The dataset consists of 112,000 training samples and 12,500 test samples and is publicly available. [...] For all evaluations of our proposed method on LLa MA-7B, we use w = 1.5 and k = 10 based on the optimal average reward performance on the validation set.
Hardware Specification Yes We conduct our experiments on servers equipped with NVIDIA RTX A6000 GPUs (48GB VRAM) and NVIDIA A100 GPUs (80GB VRAM).
Software Dependencies Yes All experiments are implemented in Python 3.11.4 using the Py Torch 1.12.1 framework.
Experiment Setup Yes Full details on training hyperparameters are included in Appendix A. [...] Table 5: Summary of training hyperparameters for supervised fine-tuning and reward modeling for LLa MA-7B models. [...] Table 6: Summary of training hyperparameters for supervised fine-tuning and reward modeling for OPT-family models. [...] Table 7: Summary of training hyperparameters for proximal policy optimization (PPO). [...] Table 8: Summary of training hyperparameters for Direct Policy Optimization (DPO).