Meta Reinforcement Learning with Autonomous Inference of Subtask Dependencies

Authors: Sungryull Sohn, Hyunjae Woo, Jongwook Choi, Honglak Lee

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach on various environments ranging from simple grid-world (Sohn et al., 2018) to Star Craft II (Vinyals et al., 2017). In all cases, our method can accurately infer the latent subtask graph structure, and to adapt more efficiently to unseen tasks than the baselines.
Researcher Affiliation Collaboration 1University of Michigan 2Google Brain {srsohn,hjwoo,jwook}@umich.edu honglak@google.com
Pseudocode Yes Algorithm 1 Adaptation policy optimization during meta-training
Open Source Code No The paper does not explicitly state that the source code for its methodology is made available, nor does it provide a link to a code repository for its implementation.
Open Datasets Yes We evaluate our approach in comparison with the following baselines: Random is a policy that executes a random eligible subtask that has not been completed. RL2 is the meta-RL agent in Duan et al. (2016), trained to maximize the return over K episodes. HRL is the hierarchical RL agent in Sohn et al. (2018) trained with the same actor-critic method as our approach during adaptation phase. The network parameter is reset when the task changes. GRProp+Oracle is the GRProp policy (Sohn et al., 2018) provided with the ground-truth subtask graph as input. This is roughly an upper bound of the performance of MSGI-based approaches. MSGI-Rand (Ours) uses a random policy as an adaptation policy, with the task inference module. MSGI-Meta (Ours) uses a meta-learned policy (i.e., πadapt θ ) as an adaptation policy, with the task inference module. For RL2 and HRL, we use the same network architecture as our MSGI adaptation policy. More details of training and network architecture can be found in Appendix J. The domains on which we evaluate these approaches include two simple grid-world environments (Mining and Playground) (Sohn et al., 2018) and a more challenging domain SC2LE (Vinyals et al., 2017) (Star Craft II).
Dataset Splits No For Playground, we follow the setup of (Sohn et al., 2018): we train the agent on D1-Train with the adaptation budget of 10 episodes, and test on unseen graph distributions D1-Eval and larger graphs D2-D4 (See Appendix C for more details about the tasks in Playground and Mining).
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory.
Software Dependencies No The paper mentions using "scikit-learn" but does not specify its version or any other software dependencies with their respective version numbers.
Experiment Setup Yes We used actor-critic method with GAE (Schulman et al., 2016) as follows: where we used the learning rate η = 0.002, γ = 1, and λ = 0.9. We used RMSProp optimizer with the smoothing parameter of 0.99 and epsilon of 1e-5. We trained our MSGI-Meta agent for 8000 trials, where the agent is updated after every trial. We used the best hyper-parameters chosen from the sets specified in Table 4 for all the agents. We also used the entropy regularization with annealed parameter βent. We started from βent = 0.05 and linearly decreased it after 1200 trials until it reaches βent = 0 at 3200 trials. During training, we update the critic network to minimize E[(Rt V π θ (st))2], where Rt is the cumulative reward at time t with the weight of 0.03. We clipped the magnitude of gradient to be no larger than 1.