Meta-Reinforcement Learning via Exploratory Task Clustering

Authors: Zhendong Chu, Renqin Cai, Hongning Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed method on environments with parametric clusters (e.g., rewards and state dynamics in the Mu Jo Co suite) and non-parametric clusters (e.g., control skills in the Meta World suite). The results demonstrate strong advantages of our solution against a set of representative meta-RL methods. Experiments In this section, we conduct extensive experiments to study the following research questions:
Researcher Affiliation Collaboration Zhendong Chu1, Renqin Cai2, Hongning Wang1 1Department of Computer Science, University of Virginia 2Meta zc9uy@virginia.edu, renqincai@meta.com, hw5x@virginia.edu
Pseudocode Yes The detailed pseudo-codes of meta-train and meta-test phases for MILET are shown in Appendix B.
Open Source Code No The paper does not provide a direct link to the source code for MILET or explicitly state that the code for their method is being released. It mentions using "implementations of baselines provided by the original papers," but not its own code.
Open Datasets Yes We evaluated MILET on two continuous control tasks with clustered reward functions, simulated by Mu Jo Co (Todorov, Erez, and Tassa 2012). We also evaluated MILET on a challenging task suite Meta World (Yu et al. 2020).
Dataset Splits No For each environment, we created 500 tasks for meta-train and hold out 32 new tasks for meta-test. The paper specifies train and test splits, but it does not mention a separate validation set or split for hyperparameter tuning or model selection.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions algorithms and frameworks like Proximal Policy Optimization (PPO), Gated Recurrent Units (GRUs), and Variational Autoencoders (VAE), as well as the MuJoCo simulator. However, it does not specify any software library names with their version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We trained MILET via Proximal Policy Optimization (PPO) (Schulman et al. 2017) and set the default cluster number C to 4. We train MILET for 2000 meta-iterations using Proximal Policy Optimization (PPO). The batch size is set to 2000 with 10 epochs. We use Adam optimizer with a learning rate of 3e-4, γ = 0.99, and GAE-λ = 0.95. The hidden sizes for all GRU are set to 256. For exploration, we set ah = 0.9, bh = 0.1, sh = 0.005, ac = 0.9, bc = 0.05, sc = 0.005. The length of an episode is set to 150. More implementation details can be found in Appendix D.