EX2: Exploration with Exemplar Models for Deep Reinforcement Learning
Authors: Justin Fu, John Co-Reyes, Sergey Levine
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The goal of our experimental evaluation is to compare the EX2 method to both a naïve exploration strategy and to recently proposed exploration schemes for deep reinforcement learning based on explicit density modeling. We present results on both low-dimensional benchmark tasks used in prior work, and on more complex vision-based tasks |
| Researcher Affiliation | Academia | Justin Fu John D. Co-Reyes Sergey Levine University of California Berkeley {justinfu,jcoreyes,svlevine}@eecs.berkeley.edu |
| Pseudocode | Yes | Algorithm 1 EX2 for batch policy optimization |
| Open Source Code | Yes | Our code and additional supplementary material including videos will be available at https://sites.google.com/view/ex2exploration. |
| Open Datasets | Yes | Our experiments include three low-dimensional tasks intended to assess whether EX2 can successfully perform implicit density estimation and computer exploration bonuses, and four high-dimensional image-based tasks of varying difficulty intended to evaluate whether implicit density estimation provides improvement in domains where generative modeling is difficult. The first low-dimensional task is a continuous 2D maze with a sparse reward function that only provides a reward when the agent is within a small radius of the goal. The other two low-dimensional tasks are benchmark tasks from the Open AI gym benchmark suite, Sparse Half Cheetah and Swimmer Gather, which provide for a comparison against prior work on generative exploration bonuses in the presence of sparse rewards. For the vision-based tasks, we include three Atari games, as well as a much more difficult ego-centric navigation task based on viz Doom (Doom My Way Home+). |
| Dataset Splits | No | The paper describes using a replay buffer and current trajectories for training the discriminator, but does not provide explicit training, validation, and test dataset splits or percentages for the overall experimental setup. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'TRPO (Schulman et al., 2015)' for policy optimization but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions or library versions). |
| Experiment Setup | No | The paper mentions 'beta is a hyperparameter that can be tuned to the magnitude of the task reward' but does not provide concrete hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training configurations for the experimental setup in the main text. |