Deep Reinforcement Learning from Human Preferences

Authors: Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 3 Experimental Results, In our first set of experiments, we attempt to solve a range of benchmark tasks for deep RL without observing the true reward. We implemented our algorithm in Tensor Flow (Abadi et al., 2016). We interface with Mu Jo Co (Todorov et al., 2012) and the Arcade Learning Environment (Bellemare et al., 2013) through the Open AI Gym (Brockman et al., 2016).
Researcher Affiliation Industry Paul F Christiano Open AI paul@openai.com, Jan Leike Deep Mind leike@google.com, Tom B Brown Google Brain tombbrown@google.com, Miljan Martic Deep Mind miljanm@google.com, Dario Amodei Open AI damodei@openai.com
Pseudocode No The paper describes its method in prose, but does not include any pseudocode blocks, algorithm figures, or similarly formatted procedural steps.
Open Source Code No The paper does not explicitly state that the source code for the described methodology is publicly available, nor does it provide any links to a code repository.
Open Datasets Yes Our experiments take place in two domains: Atari games in the Arcade Learning Environment (Bellemare et al., 2013), and robotics tasks in the physics simulator Mu Jo Co (Todorov et al., 2012).
Dataset Splits Yes A fraction of 1/e of the data is held out to be used as a validation set for each predictor.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments.
Software Dependencies No The paper mentions software like 'Tensor Flow', 'Mu Jo Co', 'Arcade Learning Environment', and 'Open AI Gym', but it does not specify their version numbers or other software dependencies with versions.
Experiment Setup Yes In this paper, we use advantage actor-critic (A2C; Mnih et al., 2016) to play Atari games, and trust region policy optimization (TRPO; Schulman et al., 2015) to perform simulated robotics tasks. In each case, we used parameter settings which have been found to work well for traditional RL tasks. The only hyperparameter which we adjusted was the entropy bonus for TRPO. We use 2 regularization and adjust the regularization coefficient to keep the validation loss between 1.1 and 1.5 times the training loss. In some domains we also apply dropout for regularization.