Deep Reinforcement Learning That Matters
Authors: Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, David Meger
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. We perform a set of experiments designed to provide insight into the questions posed. |
| Researcher Affiliation | Collaboration | 1 McGill University, Montreal, Canada 2 Microsoft Maluuba, Montreal, Canada |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Specific details can be found in the supplemental and code can be found at: https://git.io/vFHnf |
| Open Datasets | Yes | We use the Hopper-v1 and Half Cheetah-v1 MuJoCo (Todorov, Erez, and Tassa 2012) environments from OpenAI Gym (Brockman et al. 2016). |
| Dataset Splits | No | The paper specifies training on "2M samples (i.e. 2M timesteps in the environment)" and discusses evaluating final performance, but it does not provide explicit training/validation/test dataset splits in terms of percentages or counts for a static dataset, which is common in supervised learning. In this RL context, data is generated through interaction. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, memory). |
| Software Dependencies | No | The paper mentions using "Open AI Baselines implementations" of various algorithms and the use of "TensorFlow" and "Theano" for certain implementations. However, it does not specify exact version numbers for these software dependencies or for any programming languages/libraries used. |
| Experiment Setup | Yes | For DDPG we use a network structure of (64, 64, ReLU) for both actor and critic. For TRPO and PPO, we use (64, 64, tanh) for the policy. For ACKTR, we use (64, 64, tanh) for the actor and (64, 64, ELU) for the critic. We investigate three multilayer perceptron (MLP) architectures commonly seen in the literature: (64, 64), (100, 50, 25), and (400, 300). Furthermore, we vary the activation functions of both the value and policy networks across tanh, ReLU, and Leaky ReLU activations. |