VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

Authors: Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, Shimon Whiteson

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In a grid-world domain, we illustrate how vari BAD performs structured online exploration as a function of task uncertainty. We further evaluate vari BAD on Mu Jo Co domains widely used in meta-RL and show that it achieves higher online return than existing methods.
Researcher Affiliation Collaboration Luisa Zintgraf University of Oxford Kyriacos Shiarlis Latent Logic Maximilian Igl University of Oxford Sebastian Schulze University of Oxford Yarin Gal University of Oxford Katja Hofmann Microsoft Research Shimon Whiteson University of Oxford
Pseudocode No The paper includes architectural diagrams (Figure 2) but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code Yes Details and hyperparameters can be found in the appendix, and at https://github.com/lmzintgraf/varibad.
Open Datasets Yes We further evaluate vari BAD on Mu Jo Co domains widely used in meta-RL and show that it achieves higher online return than existing methods. ... Environments taken from https://github.com/katerakelly/oyster.
Dataset Splits No The paper describes training and evaluation in a meta-learning setting with a distribution over tasks and 'meta-test time' evaluation, but does not specify traditional dataset splits (e.g., 80/10/10) for a fixed dataset.
Hardware Specification Yes This work was supported by a generous equipment grant and a donated DGX-1 from NVIDIA.
Software Dependencies No The paper mentions 'We used the Py Torch framework' in Appendix B.2 and C.6, but does not provide a specific version number for PyTorch or other software dependencies.
Experiment Setup Yes Appendix B.2: 'Hyperparameters for vari BAD are: RL Algorithm A2C Number of policy steps 60 Number of parallel processes 16 Epsilon 1e-5 Discount factor γ 0.95 Max grad norm 0.5 Value loss coefficient 0.5 Entropy coefficient 0.01 GAE parameter tau 0.95 ELBO loss coefficient 1.0 Policy LR 0.001 Policy VAE 0.001 Task embedding size 5 Policy architecture 2 hidden layers, 32 nodes each, Tan H activations Encoder architecture FC layer with 40 nodes, GRU with hidden size 64, output layer with 10 outputs (µ and σ), Re Lu activations Reward decoder architecture 2 hidden layers, 32 nodes each, 25 outputs heads, Re Lu activations Decoder loss function Binary cross entropy'. Appendix C.6 also provides hyperparameters for MuJoCo experiments.