Natural Value Approximators: Learning when to Trust Past Estimates

Authors: Zhongwen Xu, Joseph Modayil, Hado P. van Hasselt, Andre Barreto, David Silver, Tom Schaul

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that this one change leads to significant improvements on multiple Atari games, when applied to the state-of-the-art A3C algorithm. ... In this section, we integrate our method within A3C (Asynchronous advantage actor-critic [9]), ... We investigate the performance of natural value estimates on a collection of 57 video games games from the Atari Learning Environment [1], which has become a standard benchmark for Deep RL methods because of the rich diversity of challenges present in the various games.
Researcher Affiliation Industry Zhongwen Xu Deep Mind zhongwen@google.com Joseph Modayil Deep Mind modayil@google.com Hado van Hasselt Deep Mind hado@google.com Andre Barreto Deep Mind andrebarreto@google.com David Silver Deep Mind davidsilver@google.com Tom Schaul Deep Mind schaul@google.com
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any statements or links indicating that open-source code for their method is available.
Open Datasets Yes We investigate the performance of natural value estimates on a collection of 57 video games games from the Atari Learning Environment [1], which has become a standard benchmark for Deep RL methods because of the rich diversity of challenges present in the various games.
Dataset Splits No The paper describes evaluation metrics and conditions ('human starts' and 'no-op starts') but does not specify traditional train/validation/test dataset splits with percentages or counts.
Hardware Specification No We train agents for 80 Million agent steps (320 Million Atari game frames) on a single machine with 16 cores. This mentions the number of CPU cores but lacks specific CPU or GPU models, memory details, or other hardware specifications.
Software Dependencies No The paper mentions 'A3C algorithm' and 'Adam' optimizer but does not provide specific version numbers for these or other software components.
Experiment Setup Yes The network architecture is composed of three layers of convolutions, followed by a fully connected layer with output h, which feeds into the two separate heads (π with an additional softmax, and a scalar v...). The updates are done online with a buffer of the past 20-state transitions. The value targets are n-step targets Zn t... We train agents for 80 Million agent steps (320 Million Atari game frames)... we set k to 50. The networks are trained for 5000 steps using Adam [5] with minibatch size 32.