Tool-Augmented Reward Modeling
Authors: Lei Li, Yekun Chai, Shuohuan Wang, Yu Sun, Hao Tian, Ningyu Zhang, Hua Wu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results demonstrate a noteworthy overall improvement of 17.7% across eight tasks in preference ranking. Furthermore, our approach outperforms Gopher 280B by 7.3% on Truthful QA task in zero-shot evaluation. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines across four distinct tasks. |
| Researcher Affiliation | Collaboration | Lei Li Yekun Chai Shuohuan Wang Yu Sun Hao Tian Ningyu Zhang Hua Wu Zhejiang University Baidu Inc. {leili21,zhangningyu}@zju.edu.cn {chaiyekun,wangshuohuan,sunyu02}@baidu.com |
| Pseudocode | No | The paper describes the steps of its framework (Thought, Action, Observation, Rationale, Reward) and illustrates a data creation pipeline, but it does not contain a formal pseudocode block or an algorithm labeled as such. |
| Open Source Code | Yes | We have made the code, data, and model checkpoints publicly available to facilitate and inspire further research advancements1. 1https://github.com/ernie-research/Tool-Augmented-Reward-Model |
| Open Datasets | Yes | Our contribution also includes the creation of a comprehensive tool-augmented reward dataset, TARA, which encompasses detailed data on human preferences and intricate tool invocation processes. This dataset will be made publicly available in hopes of facilitating research advancements in the field. |
| Dataset Splits | No | TARA comprises a total of 13,604 training datasets and 1,469 test sets, each consisting of a question, a positive answer, and a negative answer. The paper specifies training and test set sizes, but does not explicitly state the size or method of a separate validation split for hyperparameter tuning. |
| Hardware Specification | Yes | All models are trained in the same environment (8 40G A100 GPUs). |
| Software Dependencies | No | The paper mentions software components and frameworks like Vicuna, Bert-Large, Lo RA, PEFT, and Deepspeed-Chat, but it does not provide specific version numbers for these libraries or other ancillary software dependencies required for reproducibility. |
| Experiment Setup | Yes | We report the experimental hyper-parameters at Table 9. We list the hyper-parameters of SFT and PPO phases in Table 10. |