Muesli: Combining Improvements in Policy Optimization

Authors: Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, Hado Van Hasselt

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The majority of our experiments were performed on 57 classic Atari games from the Arcade Learning Environment...To help understand the different design choices made in Muesli, our experiments on Atari include multiple ablations of our proposed update. Additionally, to evaluate how well our method generalises to different domains, we performed experiments on a suite of continuous control environments...We also conducted experiments in 9x9 Go in self-play...
Researcher Affiliation Collaboration 1Deep Mind, London, UK 2University College London. Correspondence to: Matteo Hessel <mtthss@google.com>, Ivo Danihelka <danihelka@google.com>, Hado van Hasselt <hado@google.com>.
Pseudocode No The paper describes methods mathematically and through text, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code No The paper mentions using specific libraries like JAX, Optax, Haiku, and Rlax, but does not provide an explicit statement about releasing the source code for the Muesli methodology described in the paper, nor does it provide a link to a code repository.
Open Datasets Yes The majority of our experiments were performed on 57 classic Atari games from the Arcade Learning Environment (Bellemare et al., 2013; Machado et al., 2018)... Additionally, to evaluate how well our method generalises to different domains, we performed experiments on a suite of continuous control environments (based on Mu Jo Co and sourced from the Open AI Gym (Brockman et al., 2016)).
Dataset Splits No The paper mentions training agents using 'uniform experience replay' and 'multi-step returns' but does not specify explicit training/validation/test dataset splits, nor does it provide percentages or sample counts for validation.
Hardware Specification No The paper does not provide specific details regarding the hardware used for running experiments, such as GPU models, CPU types, or cloud computing specifications.
Software Dependencies No The paper mentions using JAX, Optax, Haiku, and Rlax libraries, but it does not specify their version numbers, which are required for a reproducible description of software dependencies.
Experiment Setup Yes We used c = 1 in our experiments, across all domains... We used λ = 1 in all other experiments reported in the paper... All agents in this section are trained using the Sebulba podracer architecture (Hessel et al., 2021)... the model described in Section 4.3 is parametrized by an LSTM (Hochreiter & Schmidhuber, 1997)... Agents are trained using uniform experience replay, and estimate multi-step returns using Retrace (Munos et al., 2016).