OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Authors: Zihao Wang, Shaofei Cai, Zhancun Mu, Haowei Lin, Ceyao Zhang, Xuejie Liu, Qing Li, Anji Liu, Xiaojian (Shawn) Ma, Yitao Liang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper presents Omni JARVIS, a novel Vision-Language-Action (VLA) model for open-world instruction-following agents in Minecraft. Compared to prior works that either emit textual goals to separate controllers or produce the control command directly, Omni JARVIS seeks a different path to ensure both strong reasoning and efficient decision-making capabilities via unified tokenization of multimodal interaction data. ... Omni JARVIS demonstrates excellent performances on a comprehensive collection of atomic, programmatic, and open-ended tasks in open-world Minecraft. Our analysis further unveils the crucial design principles in interaction data formation, unified tokenization, and its scaling potentials. The dataset, models, and code will be released at https://craftjarvis.org/Omni JARVIS/.
Researcher Affiliation Collaboration Zihao Wang1, Shaofei Cai1, Zhancun Mu2, Haowei Lin1, Ceyao Zhang3, Xuejie Liu1 Qing Li3, Anji Liu4, Xiaojian Ma3, Yitao Liang1 Team Craft Jarvis 1Institute for Artificial Intelligence, Peking University 2Yuanpei College, Peking University 3Beijing Institute for General Artificial Intelligence (BIGAI) 4University of California, Los Angeles
Pseudocode No No explicit pseudocode or algorithm blocks were found in the paper.
Open Source Code No The dataset, models, and code will be released at https://craftjarvis.org/Omni JARVIS/.
Open Datasets Yes The training data for Behavior Tokenizer comes from Contractor Dataset [2], which is a collection of Minecraft gameplay videos.
Dataset Splits No The paper mentions evaluating on a 'validation set' in Figure 5 and 'validation datasets' in Table 5, but it does not specify the quantitative splits (e.g., percentages or exact sample counts) for this set from the overall dataset.
Hardware Specification Yes Training took place on 8 A800 GPUs with FSDP... training was conducted on a cluster of eight NVIDIA 3090 Ti graphics cards.
Software Dependencies No The paper mentions using 'SFTTrainer class from the TRL library by Hugging Face' and 'FSDP', but it does not specify version numbers for these software components or any other key libraries.
Experiment Setup Yes The learning rate was set at 1.4e-5, and a cosine learning rate scheduler was employed. The weight decay parameter was set to 0 with a warm-up ratio of 0.03. ... with a batch size of 2 and gradient accumulation steps of 4 using bf16 precision. ... The learning rate was set at 0.00004, with a weight decay of 0.001. The batch size was configured to 2.