reproducibilityindex.ai

Deep Reinforcement Learning from Human Preferences

Authors: Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	3 Experimental Results, In our ﬁrst set of experiments, we attempt to solve a range of benchmark tasks for deep RL without observing the true reward. We implemented our algorithm in Tensor Flow (Abadi et al., 2016). We interface with Mu Jo Co (Todorov et al., 2012) and the Arcade Learning Environment (Bellemare et al., 2013) through the Open AI Gym (Brockman et al., 2016).
Researcher Affiliation	Industry	Paul F Christiano Open AI paul@openai.com, Jan Leike Deep Mind leike@google.com, Tom B Brown Google Brain tombbrown@google.com, Miljan Martic Deep Mind miljanm@google.com, Dario Amodei Open AI damodei@openai.com
Pseudocode	No	The paper describes its method in prose, but does not include any pseudocode blocks, algorithm figures, or similarly formatted procedural steps.
Open Source Code	No	The paper does not explicitly state that the source code for the described methodology is publicly available, nor does it provide any links to a code repository.
Open Datasets	Yes	Our experiments take place in two domains: Atari games in the Arcade Learning Environment (Bellemare et al., 2013), and robotics tasks in the physics simulator Mu Jo Co (Todorov et al., 2012).
Dataset Splits	Yes	A fraction of 1/e of the data is held out to be used as a validation set for each predictor.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments.
Software Dependencies	No	The paper mentions software like 'Tensor Flow', 'Mu Jo Co', 'Arcade Learning Environment', and 'Open AI Gym', but it does not specify their version numbers or other software dependencies with versions.
Experiment Setup	Yes	In this paper, we use advantage actor-critic (A2C; Mnih et al., 2016) to play Atari games, and trust region policy optimization (TRPO; Schulman et al., 2015) to perform simulated robotics tasks. In each case, we used parameter settings which have been found to work well for traditional RL tasks. The only hyperparameter which we adjusted was the entropy bonus for TRPO. We use 2 regularization and adjust the regularization coefﬁcient to keep the validation loss between 1.1 and 1.5 times the training loss. In some domains we also apply dropout for regularization.