Improving Generalization in Meta-RL with Imaginary Tasks from Latent Dynamics Mixture

Authors: Suyoung Lee, Sae-Young Chung

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental LDM significantly outperforms standard meta-RL methods in test returns on the gridworld navigation and Mu Jo Co tasks where we strictly separate the training task distribution and the test task distribution. 5 Experiments We evaluate LDM and other meta-RL methods on the gridworld example (Figure 1) and three Mu Jo Co meta-RL tasks [42].
Researcher Affiliation Academia Suyoung Lee KAIST suyoung.l@kaist.ac.kr Sae-Young Chung KAIST schung@kaist.ac.kr
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplementary material or as a URL)? [Yes] as a URL in Appendix
Open Datasets No We evaluate LDM and other meta-RL methods on the gridworld example (Figure 1) and three Mu Jo Co meta-RL tasks [42].
Dataset Splits Yes To evaluate the generalization ability of agents in environments unseen during training, we split M into two strictly disjoint training and test sets of MDPs, i.e., M = Mtrain Mtest and Mtrain Mtest = . Table 1: Set of training, test and evaluation tasks of Mujoco tasks. k {0, 1, 2, 3}.
Hardware Specification Yes All our experiments run on a cluster of Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz machines with NVIDIA Tesla V100 GPUs.
Software Dependencies No The code is written in PyTorch. Our implementation builds upon the offi cial implementation of vari BAD (https://github.com/lmzintgraf/varibad) and PEARL (https://github.com/katerakelly/o ffi cial-pearl). We use MuJoCo 2.0 as the physics engine for all continuous control tasks.
Experiment Setup Yes All baselines are given N = 4 rollout episodes for a fixed task except for Pro MP and E-MAML that are given N = 20 rollouts. Such choice of N follows from the reference implementations of the baselines. The time horizon has been carefully set so that the agent can not visit all states within the first episode but can visit them within two episodes. If a rollout is over, the agent is reset to the origin. We set LDM s pdrop = 0.5 for all Mu Jo Co tasks. We train n = 14 normal workers and ˆn = 2 mixture workers in parallel unless otherwise stated.