Off-Policy Actor-Critic with Shared Experience Replay
Authors: Simon Schmitt, Matteo Hessel, Karen Simonyan
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide extensive empirical validation of the proposed solutions on DMLab-30 and further show the benefits of this setup in two training regimes for Atari |
| Researcher Affiliation | Industry | 1Deep Mind. Correspondence to: Simon Schmitt <suschmitt@google.com>. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | We provide extensive empirical validation of the proposed solutions on DMLab-30 and further show the benefits of this setup in two training regimes for Atari. ... As a result, we present state-of-the-art data efficiency in Section 5 in terms of median human normalized performance across 57 Atari games (Bellemare et al., 2013), as well as improved learning efficiency on DMLab30 (Beattie et al., 2016) |
| Dataset Splits | No | The paper discusses training regimes and evaluation metrics (e.g., median score across tasks) but does not provide specific training/validation/test dataset splits or their sizes. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library or solver names with version numbers. |
| Experiment Setup | Yes | Following (Xu et al., 2018) we use a discount of 0.995. Motivated by recent work by (Kaiser et al., 2019), we use the IMPALA deep network and increased the number of channels 4 . We use 96% replay data per batch. Differently from (Espeholt et al., 2018), we do not use gradient clipping by norm (Pascanu et al., 2012). Updates are computed on mini-batches of 32 (regular) and 128 (replay) trajectories, each corresponding to 19 steps in the environment. |