DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm
Authors: Yunhao Tang, Tadashi Kozuno, Mark Rowland, Anna Harutyunyan, Remi Munos, Bernardo Avila Pires, Michal Valko
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When combined with the IMPALA architecture, Do Mo-AC has showed improvements over the baseline algorithm on Atari-57 game benchmarks. |
| Researcher Affiliation | Industry | 1Google DeepMind 2Omron Sinic X. Correspondence to: Yunhao Tang <robintyh@deepmind.com>. |
| Pseudocode | Yes | Algorithm 1 Doubly multi-step off-policy actor-critic (Do Mo-AC) |
| Open Source Code | No | The paper does not provide an explicit statement or link indicating that the source code for the methodology is openly available. |
| Open Datasets | Yes | All evaluation environments are the entire suite of Atari games (Bellemare et al., 2013) consisting of 57 levels. |
| Dataset Splits | No | The paper does not provide specific details on dataset splits (e.g., percentages, sample counts) for training, validation, or testing. |
| Hardware Specification | No | The paper mentions 'a central GPU learner and N = 512 distributed CPU actors' but does not provide specific models or specifications for the GPU or CPU hardware used. |
| Software Dependencies | No | The paper mentions using 'RMSProp optimizers (Tieleman et al., 2012)' but does not provide specific version numbers for any software libraries, frameworks, or programming languages used. |
| Experiment Setup | Yes | The policy/value function networks are both trained by RMSProp optimizers (Tieleman et al., 2012) with learning rate α = 5 10^-4 and no momentum. To encourage exploration, the policy loss is augmented by an entropy regularization term with coefficient ce = 0.01 and baseline loss with coefficient cv = 0.5, i.e. the full loss L = Lpolicy + cv Lvalue + ce Lentropy. |