Learning Goal-Conditioned Representations for Language Reward Models

Authors: Vaskar Nath, Dylan Slack, Jeff Da, Yuntao Ma, Hugh Zhang, Spencer Whitehead, Sean Hendryx

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose training reward models (RMs) in a contrastive, goal-conditioned fashion by increasing the representation similarity of future states along sampled preferred trajectories and decreasing the similarity along randomly sampled dispreferred trajectories. This objective significantly improves reward model performance by up to 0.09 AUROC across challenging benchmarks, such as MATH and GSM8k. These findings extend to general alignment as well on the Helpful-Harmless dataset, we observe 2.3% increase in accuracy. Beyond improving reward model performance, we show this way of training RM representations enables improved steerability because it allows us to evaluate the likelihood of an action achieving a particular goal-state (e.g., whether a solution is correct or helpful). Leveraging this insight, we find that we can filter up to 55% of generated tokens during majority voting by discarding trajectories likely to end up in an incorrect state, which leads to significant cost savings. We additionally find that these representations can perform fine-grained control by conditioning on desired future goal-states. For example, we show that steering a Llama 3 model towards helpful generations with our approach improves helpfulness by 9.6% over a supervised-fine-tuning trained baseline. Similarly, steering the model towards complex generations improves complexity by 21.6% over the baseline. Overall, we find that training RMs in this contrastive, goal-conditioned fashion significantly improves performance and enables model steerability.
Researcher Affiliation Industry Vaskar Nath Dylan Slack Jeff Da Yuntao Ma Hugh Zhang Spencer Whitehead Sean Hendryx Equal contribution Corresponding author: vaskar.nath@scale.com Equal senior authorship
Pseudocode No The paper describes its methods verbally and with mathematical equations but does not include explicit pseudocode or algorithm blocks with labels such as "Pseudocode" or "Algorithm".
Open Source Code Yes Code available at https://github.com/vaskarnathscale/goal-conditioned-rm
Open Datasets Yes For training reward models, we use the Open Math Instruct-1 dataset [62], a math instruction tuning dataset with 1.8M problem-solution pairs on the training sets of GSM8k and MATH, which are generated by the Mixtral-8x7B model [32]... We evaluate on several popular math benchmarks. Since the problems in the preference ranking dataset come from the training sets of GSM8k and MATH, we consider their respective test splits to be in-distribution (ID). We also evaluate on test sets we consider to be out-of-distribution (OOD), namely algebra222 [27], GSM-Hard [23], Asdiv [46], mawps [35], and svamp [52]... We use the Helpful-Harmless dataset [8] for training and evaluation... The prototype is constructed using generations labeled as helpful, complex, coherent, and correct by annotators in Nvidia s Help Steer dataset [67]... Appendix J: Table 15: Asset information and licences. (lists URLs for all datasets)
Dataset Splits No The paper mentions "training sets" and "test splits" for various datasets (GSM8k, MATH, Helpful-Harmless, Help Steer), but it does not specify explicit validation dataset splits (e.g., percentages, absolute counts) for its experimental setup or state that standard validation splits from cited works were used.
Hardware Specification Yes For the reward model training and PPO, we used a single node, 8 gpus, and 88 CPU cores, 80 Gi GPU memory and 1TB system memory... Natural language alignment training jobs were done on 1 core, 8 GPUs, 88 CPU cores, 1TB system memory, and 80GB GPU memory.
Software Dependencies Yes We implement the reward model training with contrastive loss and the baseline of standard preference ranking using the Open RLHF library [29] in combination with Py Torch [51]... Our sandbox environment is constructed using Python 3.10 and includes necessary mathematical libraries such as Num Py to support the execution of LM-generated code [26]... We compute an AUROC score using the Python scikit-learn package [53]... We additionally used flash attention 2 and deepseed with zero_stage 3, for reward model training, and zero_stage 2 for PPO training [17, 3]... All the experiment results presented are statistically significant, performed by the scipy library [65]... calculated via numpy package [26].
Experiment Setup Yes Reward model training parameters. We train both the baseline reward model and the Q-Function reward model as well as any reward model ablations for 1 epoch with batch size of 64. The hyper-parameter for the contrastive lambda of 0.5 was chosen by doing a sweep of 0.1, 0.5, 0.9, and found that the 0.5 setting had the most stable to training loss as well as training rewards for preferred vs. unpreferred completions. (Appendix A.1) Table 4: Hyper-parameters for Reward Model training Table 5: Hyper-parameters for PPO training Table 7: Hyper-parameters for Natural Language Reward Model training For the steering experiments, we set the max number of tokens per beam to 52, the beam search temperature to 0, temperature as 0.2, the number of samples to 2, and the confidence ratio to 0 e.g. equal weight to the confidence score and LM loss. (Appendix B.2)