Dense Reward for Free in Reinforcement Learning from Human Feedback
Authors: Alex James Chan, Hao Sun, Samuel Holt, Mihaela Van Der Schaar
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We turn to validating these properties empirically. The aim of our experiments is to demonstrate that using ABC allows for improved RL training that is faster, stabler, and more robust than the default sparse reward. |
| Researcher Affiliation | Academia | Alex J. Chan 1 Hao Sun 1 Samuel Holt 1 Mihaela van der Schaar 1 1University of Cambridge, Cambridge, UK. |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks were found in the paper. |
| Open Source Code | Yes | Code for implementing our methods and experiments is publicly available at https://github. com/Xander JC/attention-based-credit. |
| Open Datasets | Yes | We use the popular IMDb review database (Maas et al., 2011) containing 50,000 examples of movie reviews that have been classified as either positive or negative sentiment. ... Summarisation. We consider the aim of producing informative summaries of Reddit posts contained in the TL;DR dataset (Stiennon et al., 2020) (adapted from (V olske et al., 2017)). ... Anthropic helpfulness/harmlessness preference dataset (Bai et al., 2022) and in Appendix C.1: Dataset: https://huggingface.co/datasets/imdb |
| Dataset Splits | Yes | Out of those examples, 92,900 of are used for training and the rest for validation and testing. |
| Hardware Specification | Yes | All experiments were run on a machine with an AMD Epyc Milan 7713 CPU, 120GB RAM, and using a single NVIDIA A6000 Ada Generation GPU accelerator with 48GB VRAM. |
| Software Dependencies | No | Our experiments are implemented on top of the TRL (von Werra et al., 2020) library, making a small adjustment to the PPOTrainer class to allow it to receive a trajectory of per-token rewards instead of a single scalar episodic reward. While it mentions the TRL library, it does not provide specific version numbers for TRL or other critical software components like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Hyperparameters used for PPO using the TRL implementation are given in Tables 2, 3, and 4. Note that across methods they are held constant in order to provide a more direct comparison. |