reproducibilityindex.ai

EX2: Exploration with Exemplar Models for Deep Reinforcement Learning

Authors: Justin Fu, John Co-Reyes, Sergey Levine

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The goal of our experimental evaluation is to compare the EX2 method to both a naïve exploration strategy and to recently proposed exploration schemes for deep reinforcement learning based on explicit density modeling. We present results on both low-dimensional benchmark tasks used in prior work, and on more complex vision-based tasks
Researcher Affiliation	Academia	Justin Fu John D. Co-Reyes Sergey Levine University of California Berkeley {justinfu,jcoreyes,svlevine}@eecs.berkeley.edu
Pseudocode	Yes	Algorithm 1 EX2 for batch policy optimization
Open Source Code	Yes	Our code and additional supplementary material including videos will be available at https://sites.google.com/view/ex2exploration.
Open Datasets	Yes	Our experiments include three low-dimensional tasks intended to assess whether EX2 can successfully perform implicit density estimation and computer exploration bonuses, and four high-dimensional image-based tasks of varying difﬁculty intended to evaluate whether implicit density estimation provides improvement in domains where generative modeling is difﬁcult. The ﬁrst low-dimensional task is a continuous 2D maze with a sparse reward function that only provides a reward when the agent is within a small radius of the goal. The other two low-dimensional tasks are benchmark tasks from the Open AI gym benchmark suite, Sparse Half Cheetah and Swimmer Gather, which provide for a comparison against prior work on generative exploration bonuses in the presence of sparse rewards. For the vision-based tasks, we include three Atari games, as well as a much more difﬁcult ego-centric navigation task based on viz Doom (Doom My Way Home+).
Dataset Splits	No	The paper describes using a replay buffer and current trajectories for training the discriminator, but does not provide explicit training, validation, and test dataset splits or percentages for the overall experimental setup.
Hardware Specification	No	The paper does not provide specific details regarding the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using 'TRPO (Schulman et al., 2015)' for policy optimization but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or library versions).
Experiment Setup	No	The paper mentions 'beta is a hyperparameter that can be tuned to the magnitude of the task reward' but does not provide concrete hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training configurations for the experimental setup in the main text.