Deep Exploration via Bootstrapped DQN

Authors: Ian Osband, Charles Blundell, Alexander Pritzel, Benjamin Van Roy

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that bootstrapped DQN can combine deep exploration with deep neural networks for exponentially faster learning than any dithering strategy. In the Arcade Learning Environment bootstrapped DQN substantially improves learning speed and cumulative performance across most games.
Researcher Affiliation Collaboration Ian Osband1,2, Charles Blundell2, Alexander Pritzel2, Benjamin Van Roy1 1Stanford University, 2Google Deep Mind
Pseudocode Yes We present a detailed algorithm for our implementation of bootstrapped DQN in Appendix B.
Open Source Code No The paper discusses the implementation and performance of their algorithm but does not provide a link or an explicit statement about releasing the source code.
Open Datasets Yes We now evaluate our algorithm across 49 Atari games on the Arcade Learning Environment [1]. [1] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. ar Xiv preprint ar Xiv:1207.4708, 2012.
Dataset Splits No The paper mentions following the setup of [25] and providing full details in Appendix D, but does not explicitly state specific train/validation/test split percentages or counts in the main text.
Hardware Specification No The paper mentions that 'on a single machine our implementation runs roughly 20% slower than DQN' but does not specify any particular hardware components like CPU or GPU models.
Software Dependencies No The paper describes the algorithmic implementation details but does not list specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks).
Experiment Setup Yes We trained a fully-connected 2-layer neural networks with 50 rectified linear units (Re LU) in each layer on 50 bootstrapped samples from the data. ...We choose K = 10. ...in light of this empirical observation for Atari, we chose p=1 to save on minibatch passes.