reproducibilityindex.ai

ARGS: Alignment as Reward-Guided Search

Authors: Maxim Khanov, Jirayu Burapacheep, Yixuan Li

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section presents empirical experiments to evaluate the effectiveness of our proposed method. In particular, we aim to show that ARGS can effectively guide the outputs of the neural language model in alignment with human preference, such as helpfulness and harmfulness. All of our experiments are based on open-sourced language models and datasets.
Researcher Affiliation	Academia	Maxim Khanov1, Jirayu Burapacheep2, Yixuan Li1 University of Wisconsin-Madison1 Stanford University2 mkhanov@wisc.edu, jirayu@stanford.edu, sharonli@cs.wisc.edu
Pseudocode	Yes	Algorithm 1 ARGS-greedy Input: Previous context x with n tokens, number of candidates k, reward coefficient w, desired number of tokens m, base model LM, and reward model Output: A generated sequence with m tokens 1: for t n to m 1 do 2: V (k) top-k tokens with highest likelihood 3: for v V (k) do Iterate over top-k candidates 4: reward r([x, v]) Compute a reward of this candidate 5: scores(v) LM(v \| x) + w reward 6: end for 7: vselected arg maxv V (k) scores(v) Select token 8: x [x, vselected] 9: end for 10: return x
Open Source Code	Yes	Code is publicly available at: https://github.com/deeplearning-wisc/args.
Open Datasets	Yes	To evaluate the performance of our approach, we employ ARGS on the HH-RLHF (Helpful and Harmless) dataset (Bai et al., 2022), which is the most commonly adopted benchmark for alignment. The dataset consists of 112,000 training samples and 12,500 test samples and is publicly available. https://huggingface.co/datasets/Dahoas/full-hh-rlhf
Dataset Splits	No	The trained reward model attains a final accuracy of 74.58% on the validation set. [...] The dataset consists of 112,000 training samples and 12,500 test samples and is publicly available. [...] For all evaluations of our proposed method on LLa MA-7B, we use w = 1.5 and k = 10 based on the optimal average reward performance on the validation set.
Hardware Specification	Yes	We conduct our experiments on servers equipped with NVIDIA RTX A6000 GPUs (48GB VRAM) and NVIDIA A100 GPUs (80GB VRAM).
Software Dependencies	Yes	All experiments are implemented in Python 3.11.4 using the Py Torch 1.12.1 framework.
Experiment Setup	Yes	Full details on training hyperparameters are included in Appendix A. [...] Table 5: Summary of training hyperparameters for supervised fine-tuning and reward modeling for LLa MA-7B models. [...] Table 6: Summary of training hyperparameters for supervised fine-tuning and reward modeling for OPT-family models. [...] Table 7: Summary of training hyperparameters for proximal policy optimization (PPO). [...] Table 8: Summary of training hyperparameters for Direct Policy Optimization (DPO).