DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm

Authors: Yunhao Tang, Tadashi Kozuno, Mark Rowland, Anna Harutyunyan, Remi Munos, Bernardo Avila Pires, Michal Valko

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When combined with the IMPALA architecture, Do Mo-AC has showed improvements over the baseline algorithm on Atari-57 game benchmarks.
Researcher Affiliation Industry 1Google DeepMind 2Omron Sinic X. Correspondence to: Yunhao Tang <robintyh@deepmind.com>.
Pseudocode Yes Algorithm 1 Doubly multi-step off-policy actor-critic (Do Mo-AC)
Open Source Code No The paper does not provide an explicit statement or link indicating that the source code for the methodology is openly available.
Open Datasets Yes All evaluation environments are the entire suite of Atari games (Bellemare et al., 2013) consisting of 57 levels.
Dataset Splits No The paper does not provide specific details on dataset splits (e.g., percentages, sample counts) for training, validation, or testing.
Hardware Specification No The paper mentions 'a central GPU learner and N = 512 distributed CPU actors' but does not provide specific models or specifications for the GPU or CPU hardware used.
Software Dependencies No The paper mentions using 'RMSProp optimizers (Tieleman et al., 2012)' but does not provide specific version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup Yes The policy/value function networks are both trained by RMSProp optimizers (Tieleman et al., 2012) with learning rate α = 5 10^-4 and no momentum. To encourage exploration, the policy loss is augmented by an entropy regularization term with coefficient ce = 0.01 and baseline loss with coefficient cv = 0.5, i.e. the full loss L = Lpolicy + cv Lvalue + ce Lentropy.