reproducibilityindex.ai

Hierarchical and Interpretable Skill Acquisition in Multi-task Reinforcement Learning

Authors: Tianmin Shu, Caiming Xiong, Richard Socher

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our approach on Minecraft games designed to explicitly test the ability to reuse previously learned skills while simultaneously learning new skills. Our experimental results demonstrate that this framework can (i) efﬁciently learn hierarchical policies and representations for multi-task RL; (ii) learn to utter human instructions to deploy pretrained policies, improve their explainability and reuse skills; and (iii) learn a stochastic temporal grammar via self-supervision to predict future actions.
Researcher Affiliation	Collaboration	Tianmin Shu University of California, Los Angeles tianmin.shu@ucla.edu Caiming Xiong & Richard Socher Salesforce Research {cxiong, rsocher}@salesforce.com
Pseudocode	Yes	A PSEUDO CODE OF OUR ALGORITHMS Algorithm 1 RUN(k, g) Algorithm 2 Learning global policy and STG at stage k > 0
Open Source Code	No	The paper provides a link to a video demo ("A video demo is available at https://youtu.be/p Ov2Yi V-2XI") but does not state that the source code for the methodology is openly available or provide a link to a code repository.
Open Datasets	No	The paper describes a custom Minecraft environment created for their experiments: "Figure 3 (left) shows the two room environment in Minecraft that we created using the Malmo platform (Johnson et al., 2016)." It does not state that this specific environment or the data generated from it is publicly available, nor does it provide a link or citation for public access to the experimental setup as a dataset.
Dataset Splits	No	The paper describes a "2-phase curriculum learning" and mentions holding out tasks for testing ("For the last task set, we hold out 6 tasks... for testing"), but it does not specify traditional dataset splits (e.g., percentages or counts for training, validation, and testing data subsets). The environment is dynamic, and tasks are learned incrementally rather than from a fixed, pre-split dataset.
Hardware Specification	No	The paper does not specify the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud computing instances).
Software Dependencies	No	The paper mentions using "RMSProp (Tieleman & Hinton, 2012)" as an optimizer and the "Malmo platform (Johnson et al., 2016)" for the environment, but it does not provide specific version numbers for any software libraries, frameworks, or dependencies used in the implementation of their model.
Experiment Setup	Yes	We train the network with RMSProp (Tieleman & Hinton, 2012) with a learning rate of 0.0001. We set the batch size to be 36 and clip the gradient to a unit norm. For all tasks, the discounted coefﬁcient is γ = 0.95. For the 2-phase curriculum learning, we set the average reward threshold to be 0.9 (average rewards are estimated from the most recent 200 episodes of each task). For all experiments in this paper, we use M = 500. To encourage random exploration, we apply ϵ-greedy to the decision sampling for the global policy (i.e., only at the top level k at each stage k > 0), where ϵ gradually decreases from 0.1 to 0.