reproducibilityindex.ai

Improving Generalization in Meta-RL with Imaginary Tasks from Latent Dynamics Mixture

Authors: Suyoung Lee, Sae-Young Chung

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	LDM signiﬁcantly outperforms standard meta-RL methods in test returns on the gridworld navigation and Mu Jo Co tasks where we strictly separate the training task distribution and the test task distribution. 5 Experiments We evaluate LDM and other meta-RL methods on the gridworld example (Figure 1) and three Mu Jo Co meta-RL tasks [42].
Researcher Affiliation	Academia	Suyoung Lee KAIST suyoung.l@kaist.ac.kr Sae-Young Chung KAIST schung@kaist.ac.kr
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplementary material or as a URL)? [Yes] as a URL in Appendix
Open Datasets	No	We evaluate LDM and other meta-RL methods on the gridworld example (Figure 1) and three Mu Jo Co meta-RL tasks [42].
Dataset Splits	Yes	To evaluate the generalization ability of agents in environments unseen during training, we split M into two strictly disjoint training and test sets of MDPs, i.e., M = Mtrain Mtest and Mtrain Mtest = . Table 1: Set of training, test and evaluation tasks of Mujoco tasks. k {0, 1, 2, 3}.
Hardware Specification	Yes	All our experiments run on a cluster of Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz machines with NVIDIA Tesla V100 GPUs.
Software Dependencies	No	The code is written in PyTorch. Our implementation builds upon the offi cial implementation of vari BAD (https://github.com/lmzintgraf/varibad) and PEARL (https://github.com/katerakelly/o ffi cial-pearl). We use MuJoCo 2.0 as the physics engine for all continuous control tasks.
Experiment Setup	Yes	All baselines are given N = 4 rollout episodes for a ﬁxed task except for Pro MP and E-MAML that are given N = 20 rollouts. Such choice of N follows from the reference implementations of the baselines. The time horizon has been carefully set so that the agent can not visit all states within the ﬁrst episode but can visit them within two episodes. If a rollout is over, the agent is reset to the origin. We set LDM s pdrop = 0.5 for all Mu Jo Co tasks. We train n = 14 normal workers and ˆn = 2 mixture workers in parallel unless otherwise stated.