reproducibilityindex.ai

Grounding Multimodal Large Language Models in Actions

Authors: Andrew Szot, Bogdan Mazoure, Harsh Agrawal, R Devon Hjelm, Zsolt Kira, Alexander Toshev

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens of action space adaptors. For continuous actions, we show that a learned tokenization allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action space adapters on five different environments, encompassing over 114 embodied tasks.
Researcher Affiliation	Collaboration	1 Apple, 2 Georgia Tech, 3 Mila
Pseudocode	No	The paper does not include a figure, block, or section explicitly labeled "Pseudocode" or "Algorithm" with structured steps formatted like code.
Open Source Code	No	We also plan to release our code.
Open Datasets	Yes	CALVIN [30]: This manipulation benchmark... We use the ABC D split of the benchmark with 34 tasks... Meta-World [58]: We use the ML-45 version of this tabletop manipulation benchmark which has 45 tasks... Habitat Pick (Hab Pick) [59]: A mobile manipulation robot... Baby AI [60]: Baby AI is a grid world task... Language Rearrangement (Lang R) [21]: A mobile manipulation robot...
Dataset Splits	Yes	We use the ABC D split of the benchmark... We hold out 1024 subsequences of the policy context length from these trajectories for reporting validation performance during the SFT process. We evaluate on the test episodes from Szot et al. [72] which are 1, 000 episodes in unseen home layouts. We report performance on the unseen synonyms generalization test, described in Section 4.2 of Carta et al. [40].
Hardware Specification	Yes	We train the CALVIN, Meta-World and Hab Pick imitation learning results on a 4x A40 GPU setup. We train the Language Rearrangement and Baby AI experiments on a 8x A100-80GB GPU setup.
Software Dependencies	No	The paper mentions using "Hugging Face Transformers library [73], Py Torch [74], Deep Speed [75]" but does not specify exact version numbers for these software components.
Experiment Setup	Yes	We use LLa VA-1.5-7B [6] as the base MLLM. We finetune the MLLM for interactive tasks using a dataset of expert demonstrations... We use the Adam W optimizer [61] with a learning rate of 3e 4, a warmup period of 10% of the total number of training steps, and cosine learning rate decay to 0 by the end of training. For RL, we use PPO [62]. For the learned tokenization action space adapters, we, by default, use a codebook size of 512 with 512 dimensions per codebook element. For Lo RA we use rank value 128, alpha parameter 32 and dropout 0.1.