reproducibilityindex.ai

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

Authors: Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, Yitao Liang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate GROOT against open-world counterparts and human players on a proposed Minecraft Skill Forge benchmark. The Elo ratings clearly show that GROOT is closing the human-machine gap as well as exhibiting a 70% winning rate over the best generalist agent baseline. Qualitative analysis of the induced goal space further demonstrates some interesting emergent properties, including the goal composition and complex gameplay behavior synthesis.
Researcher Affiliation	Academia	Shaofei Cai1,2, Bowei Zhang3, Zihao Wang1,2, Xiaojian Ma5, Anji Liu4, Yitao Liang1 Team Craft Jarvis 1Institute for Artificial Intelligence, Peking University 2School of Intelligence Science and Technology, Peking University 3School of Electronics Engineering and Computer Science, Peking University 4Computer Science Department, University of California, Los Angeles 5Beijing Institute for General Artificial Intelligence (BIGAI)
Pseudocode	No	The paper includes mathematical formulations, architectural diagrams (Figure 2), and detailed descriptions of components, but it does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper provides a project website URL (https://craftjarvis.github.io/GROOT) on the first page, but it does not contain an explicit statement by the authors confirming the release of their source code for the described methodology or a direct link to a code repository within the text.
Open Datasets	Yes	The contractor data is a Minecraft offline trajectory dataset provided by Baker et al. (2022) 3, which is annotated by professional human players and used for training the inverse dynamic model. 3https://github.com/openai/Video-Pre-Training
Dataset Splits	No	The paper states it uses the 'contractor data' for training but does not specify explicit training, validation, or test dataset splits in terms of percentages or absolute counts for its own model (GROOT). The evaluation is done on a new benchmark, not on a conventional validation split.
Hardware Specification	Yes	Type of GPUs NVIDIA RTX 4090Ti, A40; Parallel GPUs 8
Software Dependencies	No	The paper mentions software components like 'Efficient Net-B0' (CNN Backbone), 'min GPT (w/o causal mask)' (Encoder Transformer), and 'Transformer XL' (Decoder Transformer) but does not provide specific version numbers for these libraries or frameworks.
Experiment Setup	Yes	Table 2: Hyperparameters for training GROOT. Includes: Optimizer Adam W, Weight Decay 0.001, Learning Rate 0.0000181, Warmup Steps 2000, Batch Size/GPU (Total) 2 (128), Training Precision bf16, Trajectory Chunk size 128, Attention Memory Size 256, Weight of KL Loss 0.01.