Safe and Efficient Off-Policy Reinforcement Learning
Authors: Remi Munos, Tom Stepleton, Anna Harutyunyan, Marc Bellemare
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate the benefits of Retrace(λ) on a standard suite of Atari 2600 games. Finally, we illustrate the significance of Retrace(λ) in a deep learning setting by applying it to the suite of Atari 2600 games provided by the Arcade Learning Environment (Bellemare et al., 2013). We compare our algorithms performance on 60 different Atari 2600 games in the Arcade Learning Environment (Bellemare et al., 2013) using Bellemare et al. s inter-algorithm score distribution. |
| Researcher Affiliation | Collaboration | R emi Munos munos@google.com Google Deep Mind Thomas Stepleton stepleton@google.com Google Deep Mind Anna Harutyunyan anna.harutyunyan@vub.ac.be Vrije Universiteit Brussel Marc G. Bellemare bellemare@google.com Google Deep Mind |
| Pseudocode | No | The paper defines mathematical operators and theoretical derivations, but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statements about making its code open source or links to code repositories for the methodology described. |
| Open Datasets | Yes | Finally, we illustrate the significance of Retrace(λ) in a deep learning setting by applying it to the suite of Atari 2600 games provided by the Arcade Learning Environment (Bellemare et al., 2013). |
| Dataset Splits | No | To validate our theoretical results, we employ Retrace(λ) in an experience replay (Lin, 1993) setting, where sample transitions are stored within a large but bounded replay memory and subsequently replayed as if they were new experience. The paper does not specify exact dataset split percentages, sample counts, or reference predefined splits for training, validation, or testing. |
| Hardware Specification | No | The paper mentions a 'deep learning setting' but does not specify any hardware details such as GPU or CPU models, processors, or memory used for experiments. |
| Software Dependencies | No | Our agent adapts the DQN architecture of Mnih et al. (2015) to replay short sequences from the memory (details in the appendix) instead of single transitions. The paper mentions using the DQN architecture and Arcade Learning Environment but does not provide specific version numbers for any software or libraries used for reproducibility. |
| Experiment Setup | No | Our agent adapts the DQN architecture of Mnih et al. (2015) to replay short sequences from the memory (details in the appendix) instead of single transitions. The main text refers to details in the appendix but does not itself contain specific hyperparameters or training configurations for the experimental setup. |