reproducibilityindex.ai

Dense Reward for Free in Reinforcement Learning from Human Feedback

Authors: Alex James Chan, Hao Sun, Samuel Holt, Mihaela Van Der Schaar

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We turn to validating these properties empirically. The aim of our experiments is to demonstrate that using ABC allows for improved RL training that is faster, stabler, and more robust than the default sparse reward.
Researcher Affiliation	Academia	Alex J. Chan 1 Hao Sun 1 Samuel Holt 1 Mihaela van der Schaar 1 1University of Cambridge, Cambridge, UK.
Pseudocode	No	No pseudocode or clearly labeled algorithm blocks were found in the paper.
Open Source Code	Yes	Code for implementing our methods and experiments is publicly available at https://github. com/Xander JC/attention-based-credit.
Open Datasets	Yes	We use the popular IMDb review database (Maas et al., 2011) containing 50,000 examples of movie reviews that have been classified as either positive or negative sentiment. ... Summarisation. We consider the aim of producing informative summaries of Reddit posts contained in the TL;DR dataset (Stiennon et al., 2020) (adapted from (V olske et al., 2017)). ... Anthropic helpfulness/harmlessness preference dataset (Bai et al., 2022) and in Appendix C.1: Dataset: https://huggingface.co/datasets/imdb
Dataset Splits	Yes	Out of those examples, 92,900 of are used for training and the rest for validation and testing.
Hardware Specification	Yes	All experiments were run on a machine with an AMD Epyc Milan 7713 CPU, 120GB RAM, and using a single NVIDIA A6000 Ada Generation GPU accelerator with 48GB VRAM.
Software Dependencies	No	Our experiments are implemented on top of the TRL (von Werra et al., 2020) library, making a small adjustment to the PPOTrainer class to allow it to receive a trajectory of per-token rewards instead of a single scalar episodic reward. While it mentions the TRL library, it does not provide specific version numbers for TRL or other critical software components like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Hyperparameters used for PPO using the TRL implementation are given in Tables 2, 3, and 4. Note that across methods they are held constant in order to provide a more direct comparison.