Hierarchical and Interpretable Skill Acquisition in Multi-task Reinforcement Learning

Authors: Tianmin Shu, Caiming Xiong, Richard Socher

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our approach on Minecraft games designed to explicitly test the ability to reuse previously learned skills while simultaneously learning new skills. Our experimental results demonstrate that this framework can (i) efficiently learn hierarchical policies and representations for multi-task RL; (ii) learn to utter human instructions to deploy pretrained policies, improve their explainability and reuse skills; and (iii) learn a stochastic temporal grammar via self-supervision to predict future actions.
Researcher Affiliation Collaboration Tianmin Shu University of California, Los Angeles tianmin.shu@ucla.edu Caiming Xiong & Richard Socher Salesforce Research {cxiong, rsocher}@salesforce.com
Pseudocode Yes A PSEUDO CODE OF OUR ALGORITHMS Algorithm 1 RUN(k, g) Algorithm 2 Learning global policy and STG at stage k > 0
Open Source Code No The paper provides a link to a video demo ("A video demo is available at https://youtu.be/p Ov2Yi V-2XI") but does not state that the source code for the methodology is openly available or provide a link to a code repository.
Open Datasets No The paper describes a custom Minecraft environment created for their experiments: "Figure 3 (left) shows the two room environment in Minecraft that we created using the Malmo platform (Johnson et al., 2016)." It does not state that this specific environment or the data generated from it is publicly available, nor does it provide a link or citation for public access to the experimental setup as a dataset.
Dataset Splits No The paper describes a "2-phase curriculum learning" and mentions holding out tasks for testing ("For the last task set, we hold out 6 tasks... for testing"), but it does not specify traditional dataset splits (e.g., percentages or counts for training, validation, and testing *data* subsets). The environment is dynamic, and tasks are learned incrementally rather than from a fixed, pre-split dataset.
Hardware Specification No The paper does not specify the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud computing instances).
Software Dependencies No The paper mentions using "RMSProp (Tieleman & Hinton, 2012)" as an optimizer and the "Malmo platform (Johnson et al., 2016)" for the environment, but it does not provide specific version numbers for any software libraries, frameworks, or dependencies used in the implementation of their model.
Experiment Setup Yes We train the network with RMSProp (Tieleman & Hinton, 2012) with a learning rate of 0.0001. We set the batch size to be 36 and clip the gradient to a unit norm. For all tasks, the discounted coefficient is γ = 0.95. For the 2-phase curriculum learning, we set the average reward threshold to be 0.9 (average rewards are estimated from the most recent 200 episodes of each task). For all experiments in this paper, we use M = 500. To encourage random exploration, we apply ϵ-greedy to the decision sampling for the global policy (i.e., only at the top level k at each stage k > 0), where ϵ gradually decreases from 0.1 to 0.