reproducibilityindex.ai

VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

Authors: Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, Shimon Whiteson

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In a grid-world domain, we illustrate how vari BAD performs structured online exploration as a function of task uncertainty. We further evaluate vari BAD on Mu Jo Co domains widely used in meta-RL and show that it achieves higher online return than existing methods.
Researcher Affiliation	Collaboration	Luisa Zintgraf University of Oxford Kyriacos Shiarlis Latent Logic Maximilian Igl University of Oxford Sebastian Schulze University of Oxford Yarin Gal University of Oxford Katja Hofmann Microsoft Research Shimon Whiteson University of Oxford
Pseudocode	No	The paper includes architectural diagrams (Figure 2) but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code	Yes	Details and hyperparameters can be found in the appendix, and at https://github.com/lmzintgraf/varibad.
Open Datasets	Yes	We further evaluate vari BAD on Mu Jo Co domains widely used in meta-RL and show that it achieves higher online return than existing methods. ... Environments taken from https://github.com/katerakelly/oyster.
Dataset Splits	No	The paper describes training and evaluation in a meta-learning setting with a distribution over tasks and 'meta-test time' evaluation, but does not specify traditional dataset splits (e.g., 80/10/10) for a fixed dataset.
Hardware Specification	Yes	This work was supported by a generous equipment grant and a donated DGX-1 from NVIDIA.
Software Dependencies	No	The paper mentions 'We used the Py Torch framework' in Appendix B.2 and C.6, but does not provide a specific version number for PyTorch or other software dependencies.
Experiment Setup	Yes	Appendix B.2: 'Hyperparameters for vari BAD are: RL Algorithm A2C Number of policy steps 60 Number of parallel processes 16 Epsilon 1e-5 Discount factor γ 0.95 Max grad norm 0.5 Value loss coefﬁcient 0.5 Entropy coefﬁcient 0.01 GAE parameter tau 0.95 ELBO loss coefﬁcient 1.0 Policy LR 0.001 Policy VAE 0.001 Task embedding size 5 Policy architecture 2 hidden layers, 32 nodes each, Tan H activations Encoder architecture FC layer with 40 nodes, GRU with hidden size 64, output layer with 10 outputs (µ and σ), Re Lu activations Reward decoder architecture 2 hidden layers, 32 nodes each, 25 outputs heads, Re Lu activations Decoder loss function Binary cross entropy'. Appendix C.6 also provides hyperparameters for MuJoCo experiments.