VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning
Authors: Luisa Zintgraf, Kyriacos Shiarlis, Maximilian Igl, Sebastian Schulze, Yarin Gal, Katja Hofmann, Shimon Whiteson
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In a grid-world domain, we illustrate how vari BAD performs structured online exploration as a function of task uncertainty. We further evaluate vari BAD on Mu Jo Co domains widely used in meta-RL and show that it achieves higher online return than existing methods. |
| Researcher Affiliation | Collaboration | Luisa Zintgraf University of Oxford Kyriacos Shiarlis Latent Logic Maximilian Igl University of Oxford Sebastian Schulze University of Oxford Yarin Gal University of Oxford Katja Hofmann Microsoft Research Shimon Whiteson University of Oxford |
| Pseudocode | No | The paper includes architectural diagrams (Figure 2) but does not provide pseudocode or a clearly labeled algorithm block. |
| Open Source Code | Yes | Details and hyperparameters can be found in the appendix, and at https://github.com/lmzintgraf/varibad. |
| Open Datasets | Yes | We further evaluate vari BAD on Mu Jo Co domains widely used in meta-RL and show that it achieves higher online return than existing methods. ... Environments taken from https://github.com/katerakelly/oyster. |
| Dataset Splits | No | The paper describes training and evaluation in a meta-learning setting with a distribution over tasks and 'meta-test time' evaluation, but does not specify traditional dataset splits (e.g., 80/10/10) for a fixed dataset. |
| Hardware Specification | Yes | This work was supported by a generous equipment grant and a donated DGX-1 from NVIDIA. |
| Software Dependencies | No | The paper mentions 'We used the Py Torch framework' in Appendix B.2 and C.6, but does not provide a specific version number for PyTorch or other software dependencies. |
| Experiment Setup | Yes | Appendix B.2: 'Hyperparameters for vari BAD are: RL Algorithm A2C Number of policy steps 60 Number of parallel processes 16 Epsilon 1e-5 Discount factor γ 0.95 Max grad norm 0.5 Value loss coefficient 0.5 Entropy coefficient 0.01 GAE parameter tau 0.95 ELBO loss coefficient 1.0 Policy LR 0.001 Policy VAE 0.001 Task embedding size 5 Policy architecture 2 hidden layers, 32 nodes each, Tan H activations Encoder architecture FC layer with 40 nodes, GRU with hidden size 64, output layer with 10 outputs (µ and σ), Re Lu activations Reward decoder architecture 2 hidden layers, 32 nodes each, 25 outputs heads, Re Lu activations Decoder loss function Binary cross entropy'. Appendix C.6 also provides hyperparameters for MuJoCo experiments. |