Grounding Multimodal Large Language Models in Actions
Authors: Andrew Szot, Bogdan Mazoure, Harsh Agrawal, R Devon Hjelm, Zsolt Kira, Alexander Toshev
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens of action space adaptors. For continuous actions, we show that a learned tokenization allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action space adapters on five different environments, encompassing over 114 embodied tasks. |
| Researcher Affiliation | Collaboration | 1 Apple, 2 Georgia Tech, 3 Mila |
| Pseudocode | No | The paper does not include a figure, block, or section explicitly labeled "Pseudocode" or "Algorithm" with structured steps formatted like code. |
| Open Source Code | No | We also plan to release our code. |
| Open Datasets | Yes | CALVIN [30]: This manipulation benchmark... We use the ABC D split of the benchmark with 34 tasks... Meta-World [58]: We use the ML-45 version of this tabletop manipulation benchmark which has 45 tasks... Habitat Pick (Hab Pick) [59]: A mobile manipulation robot... Baby AI [60]: Baby AI is a grid world task... Language Rearrangement (Lang R) [21]: A mobile manipulation robot... |
| Dataset Splits | Yes | We use the ABC D split of the benchmark... We hold out 1024 subsequences of the policy context length from these trajectories for reporting validation performance during the SFT process. We evaluate on the test episodes from Szot et al. [72] which are 1, 000 episodes in unseen home layouts. We report performance on the unseen synonyms generalization test, described in Section 4.2 of Carta et al. [40]. |
| Hardware Specification | Yes | We train the CALVIN, Meta-World and Hab Pick imitation learning results on a 4x A40 GPU setup. We train the Language Rearrangement and Baby AI experiments on a 8x A100-80GB GPU setup. |
| Software Dependencies | No | The paper mentions using "Hugging Face Transformers library [73], Py Torch [74], Deep Speed [75]" but does not specify exact version numbers for these software components. |
| Experiment Setup | Yes | We use LLa VA-1.5-7B [6] as the base MLLM. We finetune the MLLM for interactive tasks using a dataset of expert demonstrations... We use the Adam W optimizer [61] with a learning rate of 3e 4, a warmup period of 10% of the total number of training steps, and cosine learning rate decay to 0 by the end of training. For RL, we use PPO [62]. For the learned tokenization action space adapters, we, by default, use a codebook size of 512 with 512 dimensions per codebook element. For Lo RA we use rank value 128, alpha parameter 32 and dropout 0.1. |