Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VariBAD: Variational Bayes-Adaptive Deep RL via Meta-Learning

Authors: Luisa Zintgraf, Sebastian Schulze, Cong Lu, Leo Feng, Maximilian Igl, Kyriacos Shiarlis, Yarin Gal, Katja Hofmann, Shimon Whiteson

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We further evaluate vari BAD on Mu Jo Co tasks widely used in meta RL and show that it achieves higher online return than existing methods. On the recently proposed Meta-World ML1 benchmark, vari BAD achieves state of the art results by a large margin, fully solving two out of the three ML1 tasks for the first time. In Section 6, we perform ablation studies to motivate our design choices, and test how robust vari BAD is to the size of the latent space.
Researcher Affiliation Collaboration Luisa Zintgraf EMAIL University of Oxford Wolfson Building, Parks Road, OX1 3QD Oxford (UK) Sebastian Schulze EMAIL University of Oxford Cong Lu EMAIL University of Oxford Leo Feng EMAIL Mila, Universit e de Montr eal Maximilian Igl EMAIL University of Oxford; Waymo Kyriacos Shiarlis EMAIL Waymo Yarin Gal EMAIL University of Oxford Katja Hofmann EMAIL Microsoft Research Shimon Whiteson EMAIL University of Oxford
Pseudocode No The paper describes the methodology using textual descriptions, mathematical equations, and architectural diagrams (e.g., Figure 2). However, there are no structured pseudocode or algorithm blocks explicitly labeled or formatted as such.
Open Source Code Yes Experimental details, hyperparameters, as well as additional results, can be found in the appendix. The source code is available at https://github.com/lmzintgraf/varibad .
Open Datasets Yes We further evaluate vari BAD on the widely used Mu Jo Co meta-RL benchmarks, and show that vari BAD exhibits superior exploratory behaviour at test time compared to existing methods, achieving higher returns during learning. Lastly, on the recently proposed challenging Meta-World ML1 benchmark, vari BAD achieves state of the art performance with a large margin compared to existing methods... Table 1 shows the results for vari BAD and several baselines on the ML1 benchmark. Results taken from Yu et al. (2019).
Dataset Splits No The paper operates in a meta-learning setting where tasks (MDPs) are sampled from a distribution p(M) for both meta-training and meta-testing. It states: 'At meta-test time, the agent is evaluated based on the average online return it achieves within a fixed amount of time on a new task drawn from p'. However, it does not provide specific dataset split percentages, sample counts, or explicit instructions for partitioning a static dataset into training, validation, and test sets.
Hardware Specification Yes Interestingly, very large parameterisations (1000 for Grid World and 300 for Ant Goal which was the maximum we could fit into the memory of a single GPU) have a comparatively minute impact... This work was supported by a generous equipment grant and a donated DGX-1 from NVIDIA, and enabled in part by computing resources provided by Compute Canada.
Software Dependencies No We used the Py Torch framework (Paszke et al., 2017) for our experiments. This mentions PyTorch as a framework but does not specify a version number for it or any other key software dependencies.
Experiment Setup Yes Appendix E. Hyperparameters We used the Py Torch framework (Paszke et al., 2017) for our experiments. The hyperparameters for Grid World, Mu Jo Co Cheetah Dir, Point Robot and Meta World ML1-Push can be found in the tables below. For more details, see our reference implementation at https://github.com/lmzintgraf/varibad. We used different number of seeds per experiment to balance significance of results and computational required, due to the inherent randomness/difficulty of different tasks. For the main experiments, we used 20 seeds for Grid World/Navigation/Meta-World, and 10 seeds per Mu Jo Co environment. For the ablation studies, we used fewer for Mu Jo Co (5 instead of 10), and Grid World (15 instead of 20) due to computational constraints.