reproducibilityindex.ai

Deep Reinforcement Learning That Matters

Authors: Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, David Meger

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. We perform a set of experiments designed to provide insight into the questions posed.
Researcher Affiliation	Collaboration	1 McGill University, Montreal, Canada 2 Microsoft Maluuba, Montreal, Canada
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	Yes	Specific details can be found in the supplemental and code can be found at: https://git.io/vFHnf
Open Datasets	Yes	We use the Hopper-v1 and Half Cheetah-v1 MuJoCo (Todorov, Erez, and Tassa 2012) environments from OpenAI Gym (Brockman et al. 2016).
Dataset Splits	No	The paper specifies training on "2M samples (i.e. 2M timesteps in the environment)" and discusses evaluating final performance, but it does not provide explicit training/validation/test dataset splits in terms of percentages or counts for a static dataset, which is common in supervised learning. In this RL context, data is generated through interaction.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, memory).
Software Dependencies	No	The paper mentions using "Open AI Baselines implementations" of various algorithms and the use of "TensorFlow" and "Theano" for certain implementations. However, it does not specify exact version numbers for these software dependencies or for any programming languages/libraries used.
Experiment Setup	Yes	For DDPG we use a network structure of (64, 64, ReLU) for both actor and critic. For TRPO and PPO, we use (64, 64, tanh) for the policy. For ACKTR, we use (64, 64, tanh) for the actor and (64, 64, ELU) for the critic. We investigate three multilayer perceptron (MLP) architectures commonly seen in the literature: (64, 64), (100, 50, 25), and (400, 300). Furthermore, we vary the activation functions of both the value and policy networks across tanh, ReLU, and Leaky ReLU activations.