Deep Reinforcement Learning That Matters

Authors: Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, David Meger

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. We perform a set of experiments designed to provide insight into the questions posed.
Researcher Affiliation Collaboration 1 McGill University, Montreal, Canada 2 Microsoft Maluuba, Montreal, Canada
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes Specific details can be found in the supplemental and code can be found at: https://git.io/vFHnf
Open Datasets Yes We use the Hopper-v1 and Half Cheetah-v1 MuJoCo (Todorov, Erez, and Tassa 2012) environments from OpenAI Gym (Brockman et al. 2016).
Dataset Splits No The paper specifies training on "2M samples (i.e. 2M timesteps in the environment)" and discusses evaluating final performance, but it does not provide explicit training/validation/test dataset splits in terms of percentages or counts for a static dataset, which is common in supervised learning. In this RL context, data is generated through interaction.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, memory).
Software Dependencies No The paper mentions using "Open AI Baselines implementations" of various algorithms and the use of "TensorFlow" and "Theano" for certain implementations. However, it does not specify exact version numbers for these software dependencies or for any programming languages/libraries used.
Experiment Setup Yes For DDPG we use a network structure of (64, 64, ReLU) for both actor and critic. For TRPO and PPO, we use (64, 64, tanh) for the policy. For ACKTR, we use (64, 64, tanh) for the actor and (64, 64, ELU) for the critic. We investigate three multilayer perceptron (MLP) architectures commonly seen in the literature: (64, 64), (100, 50, 25), and (400, 300). Furthermore, we vary the activation functions of both the value and policy networks across tanh, ReLU, and Leaky ReLU activations.