reproducibilityindex.ai

Tool-Augmented Reward Modeling

Authors: Lei Li, Yekun Chai, Shuohuan Wang, Yu Sun, Hao Tian, Ningyu Zhang, Hua Wu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results demonstrate a noteworthy overall improvement of 17.7% across eight tasks in preference ranking. Furthermore, our approach outperforms Gopher 280B by 7.3% on Truthful QA task in zero-shot evaluation. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines across four distinct tasks.
Researcher Affiliation	Collaboration	Lei Li Yekun Chai Shuohuan Wang Yu Sun Hao Tian Ningyu Zhang Hua Wu Zhejiang University Baidu Inc. {leili21,zhangningyu}@zju.edu.cn {chaiyekun,wangshuohuan,sunyu02}@baidu.com
Pseudocode	No	The paper describes the steps of its framework (Thought, Action, Observation, Rationale, Reward) and illustrates a data creation pipeline, but it does not contain a formal pseudocode block or an algorithm labeled as such.
Open Source Code	Yes	We have made the code, data, and model checkpoints publicly available to facilitate and inspire further research advancements1. 1https://github.com/ernie-research/Tool-Augmented-Reward-Model
Open Datasets	Yes	Our contribution also includes the creation of a comprehensive tool-augmented reward dataset, TARA, which encompasses detailed data on human preferences and intricate tool invocation processes. This dataset will be made publicly available in hopes of facilitating research advancements in the field.
Dataset Splits	No	TARA comprises a total of 13,604 training datasets and 1,469 test sets, each consisting of a question, a positive answer, and a negative answer. The paper specifies training and test set sizes, but does not explicitly state the size or method of a separate validation split for hyperparameter tuning.
Hardware Specification	Yes	All models are trained in the same environment (8 40G A100 GPUs).
Software Dependencies	No	The paper mentions software components and frameworks like Vicuna, Bert-Large, Lo RA, PEFT, and Deepspeed-Chat, but it does not provide specific version numbers for these libraries or other ancillary software dependencies required for reproducibility.
Experiment Setup	Yes	We report the experimental hyper-parameters at Table 9. We list the hyper-parameters of SFT and PPO phases in Table 10.