Improving Generalization in Meta-RL with Imaginary Tasks from Latent Dynamics Mixture
Authors: Suyoung Lee, Sae-Young Chung
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | LDM significantly outperforms standard meta-RL methods in test returns on the gridworld navigation and Mu Jo Co tasks where we strictly separate the training task distribution and the test task distribution. 5 Experiments We evaluate LDM and other meta-RL methods on the gridworld example (Figure 1) and three Mu Jo Co meta-RL tasks [42]. |
| Researcher Affiliation | Academia | Suyoung Lee KAIST suyoung.l@kaist.ac.kr Sae-Young Chung KAIST schung@kaist.ac.kr |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplementary material or as a URL)? [Yes] as a URL in Appendix |
| Open Datasets | No | We evaluate LDM and other meta-RL methods on the gridworld example (Figure 1) and three Mu Jo Co meta-RL tasks [42]. |
| Dataset Splits | Yes | To evaluate the generalization ability of agents in environments unseen during training, we split M into two strictly disjoint training and test sets of MDPs, i.e., M = Mtrain Mtest and Mtrain Mtest = . Table 1: Set of training, test and evaluation tasks of Mujoco tasks. k {0, 1, 2, 3}. |
| Hardware Specification | Yes | All our experiments run on a cluster of Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz machines with NVIDIA Tesla V100 GPUs. |
| Software Dependencies | No | The code is written in PyTorch. Our implementation builds upon the offi cial implementation of vari BAD (https://github.com/lmzintgraf/varibad) and PEARL (https://github.com/katerakelly/o ffi cial-pearl). We use MuJoCo 2.0 as the physics engine for all continuous control tasks. |
| Experiment Setup | Yes | All baselines are given N = 4 rollout episodes for a fixed task except for Pro MP and E-MAML that are given N = 20 rollouts. Such choice of N follows from the reference implementations of the baselines. The time horizon has been carefully set so that the agent can not visit all states within the first episode but can visit them within two episodes. If a rollout is over, the agent is reset to the origin. We set LDM s pdrop = 0.5 for all Mu Jo Co tasks. We train n = 14 normal workers and ˆn = 2 mixture workers in parallel unless otherwise stated. |