Natural Value Approximators: Learning when to Trust Past Estimates
Authors: Zhongwen Xu, Joseph Modayil, Hado P. van Hasselt, Andre Barreto, David Silver, Tom Schaul
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that this one change leads to significant improvements on multiple Atari games, when applied to the state-of-the-art A3C algorithm. ... In this section, we integrate our method within A3C (Asynchronous advantage actor-critic [9]), ... We investigate the performance of natural value estimates on a collection of 57 video games games from the Atari Learning Environment [1], which has become a standard benchmark for Deep RL methods because of the rich diversity of challenges present in the various games. |
| Researcher Affiliation | Industry | Zhongwen Xu Deep Mind zhongwen@google.com Joseph Modayil Deep Mind modayil@google.com Hado van Hasselt Deep Mind hado@google.com Andre Barreto Deep Mind andrebarreto@google.com David Silver Deep Mind davidsilver@google.com Tom Schaul Deep Mind schaul@google.com |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any statements or links indicating that open-source code for their method is available. |
| Open Datasets | Yes | We investigate the performance of natural value estimates on a collection of 57 video games games from the Atari Learning Environment [1], which has become a standard benchmark for Deep RL methods because of the rich diversity of challenges present in the various games. |
| Dataset Splits | No | The paper describes evaluation metrics and conditions ('human starts' and 'no-op starts') but does not specify traditional train/validation/test dataset splits with percentages or counts. |
| Hardware Specification | No | We train agents for 80 Million agent steps (320 Million Atari game frames) on a single machine with 16 cores. This mentions the number of CPU cores but lacks specific CPU or GPU models, memory details, or other hardware specifications. |
| Software Dependencies | No | The paper mentions 'A3C algorithm' and 'Adam' optimizer but does not provide specific version numbers for these or other software components. |
| Experiment Setup | Yes | The network architecture is composed of three layers of convolutions, followed by a fully connected layer with output h, which feeds into the two separate heads (π with an additional softmax, and a scalar v...). The updates are done online with a buffer of the past 20-state transitions. The value targets are n-step targets Zn t... We train agents for 80 Million agent steps (320 Million Atari game frames)... we set k to 50. The networks are trained for 5000 steps using Adam [5] with minibatch size 32. |