Deep Reinforcement Learning from Human Preferences
Authors: Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3 Experimental Results, In our first set of experiments, we attempt to solve a range of benchmark tasks for deep RL without observing the true reward. We implemented our algorithm in Tensor Flow (Abadi et al., 2016). We interface with Mu Jo Co (Todorov et al., 2012) and the Arcade Learning Environment (Bellemare et al., 2013) through the Open AI Gym (Brockman et al., 2016). |
| Researcher Affiliation | Industry | Paul F Christiano Open AI paul@openai.com, Jan Leike Deep Mind leike@google.com, Tom B Brown Google Brain tombbrown@google.com, Miljan Martic Deep Mind miljanm@google.com, Dario Amodei Open AI damodei@openai.com |
| Pseudocode | No | The paper describes its method in prose, but does not include any pseudocode blocks, algorithm figures, or similarly formatted procedural steps. |
| Open Source Code | No | The paper does not explicitly state that the source code for the described methodology is publicly available, nor does it provide any links to a code repository. |
| Open Datasets | Yes | Our experiments take place in two domains: Atari games in the Arcade Learning Environment (Bellemare et al., 2013), and robotics tasks in the physics simulator Mu Jo Co (Todorov et al., 2012). |
| Dataset Splits | Yes | A fraction of 1/e of the data is held out to be used as a validation set for each predictor. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments. |
| Software Dependencies | No | The paper mentions software like 'Tensor Flow', 'Mu Jo Co', 'Arcade Learning Environment', and 'Open AI Gym', but it does not specify their version numbers or other software dependencies with versions. |
| Experiment Setup | Yes | In this paper, we use advantage actor-critic (A2C; Mnih et al., 2016) to play Atari games, and trust region policy optimization (TRPO; Schulman et al., 2015) to perform simulated robotics tasks. In each case, we used parameter settings which have been found to work well for traditional RL tasks. The only hyperparameter which we adjusted was the entropy bonus for TRPO. We use 2 regularization and adjust the regularization coefficient to keep the validation loss between 1.1 and 1.5 times the training loss. In some domains we also apply dropout for regularization. |