reproducibilityindex.ai

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Authors: Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian (Shawn) Ma, Yitao Liang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper presents Omni JARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, Omni JARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. ... Omni JARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials. The dataset, models, and code will be released at https://craftjarvis.org/Omni JARVIS/.
Researcher Affiliation	Collaboration	Zihao Wang1, Shaofei Cai1, Zhancun Mu2, Haowei Lin1, Ceyao Zhang3, Xuejie Liu1 Qing Li3, Anji Liu4, Xiaojian Ma3, Yitao Liang1 Team Craft Jarvis 1Institute for Artificial Intelligence, Peking University 2Yuanpei College, Peking University 3Beijing Institute for General Artificial Intelligence (BIGAI) 4University of California, Los Angeles
Pseudocode	No	No explicit pseudocode or algorithm blocks were found in the paper.
Open Source Code	No	The dataset, models, and code will be released at https://craftjarvis.org/Omni JARVIS/.
Open Datasets	Yes	The training data for Behavior Tokenizer comes from Contractor Dataset [2], which is a collection of Minecraft gameplay videos.
Dataset Splits	No	The paper mentions evaluating on a 'validation set' in Figure 5 and 'validation datasets' in Table 5, but it does not specify the quantitative splits (e.g., percentages or exact sample counts) for this set from the overall dataset.
Hardware Specification	Yes	Training took place on 8 A800 GPUs with FSDP... training was conducted on a cluster of eight NVIDIA 3090 Ti graphics cards.
Software Dependencies	No	The paper mentions using 'SFTTrainer class from the TRL library by Hugging Face' and 'FSDP', but it does not specify version numbers for these software components or any other key libraries.
Experiment Setup	Yes	The learning rate was set at 1.4e-5, and a cosine learning rate scheduler was employed. The weight decay parameter was set to 0 with a warm-up ratio of 0.03. ... with a batch size of 2 and gradient accumulation steps of 4 using bf16 precision. ... The learning rate was set at 0.00004, with a weight decay of 0.001. The batch size was configured to 2.