Tool-Augmented Reward Modeling

Authors: Lei Li, Yekun Chai, Shuohuan Wang, Yu Sun, Hao Tian, Ningyu Zhang, Hua Wu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate a noteworthy overall improvement of 17.7% across eight tasks in preference ranking. Furthermore, our approach outperforms Gopher 280B by 7.3% on Truthful QA task in zero-shot evaluation. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines across four distinct tasks.
Researcher Affiliation Collaboration Lei Li Yekun Chai Shuohuan Wang Yu Sun Hao Tian Ningyu Zhang Hua Wu Zhejiang University Baidu Inc. {leili21,zhangningyu}@zju.edu.cn {chaiyekun,wangshuohuan,sunyu02}@baidu.com
Pseudocode No The paper describes the steps of its framework (Thought, Action, Observation, Rationale, Reward) and illustrates a data creation pipeline, but it does not contain a formal pseudocode block or an algorithm labeled as such.
Open Source Code Yes We have made the code, data, and model checkpoints publicly available to facilitate and inspire further research advancements1. 1https://github.com/ernie-research/Tool-Augmented-Reward-Model
Open Datasets Yes Our contribution also includes the creation of a comprehensive tool-augmented reward dataset, TARA, which encompasses detailed data on human preferences and intricate tool invocation processes. This dataset will be made publicly available in hopes of facilitating research advancements in the field.
Dataset Splits No TARA comprises a total of 13,604 training datasets and 1,469 test sets, each consisting of a question, a positive answer, and a negative answer. The paper specifies training and test set sizes, but does not explicitly state the size or method of a separate validation split for hyperparameter tuning.
Hardware Specification Yes All models are trained in the same environment (8 40G A100 GPUs).
Software Dependencies No The paper mentions software components and frameworks like Vicuna, Bert-Large, Lo RA, PEFT, and Deepspeed-Chat, but it does not provide specific version numbers for these libraries or other ancillary software dependencies required for reproducibility.
Experiment Setup Yes We report the experimental hyper-parameters at Table 9. We list the hyper-parameters of SFT and PPO phases in Table 10.