PaLM-E: An Embodied Multimodal Language Model

Authors: Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluations show that Pa LM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internetscale language, vision, and visual-language domains. Our largest model with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.
Researcher Affiliation Collaboration 1Robotics at Google 2TU Berlin 3Google Research. Correspondence to: Danny Driess <danny.driess@gmail.com>, Pete Florence <peteflorence@google.com>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (e.g., clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code No The paper refers to 'https://palm-e.github.io for videos showing the capabilities of Pa LM-E on those tasks.' This link points to a project page with videos and information, but it is not an explicit statement or a direct link to the source code for the methodology described in the paper.
Open Datasets Yes Tab. 6 shows the dataset and sampling frequency for the full mixture as referred to in the experiments. The majority of the data distribution is general vision-language tasks, with less than 10% robot data. Dataset in full mixture Sampling frequency % Webli (Chen et al., 2022) 100 52.4 VQ2A (Changpinyo et al., 2022) 25 13.1 VQG (Changpinyo et al., 2022) 10 5.2 CC3M (Sharma et al., 2018) 25 13.1 Object Aware (Piergiovanni et al., 2022) 10 5.2 OK-VQA (Marino et al., 2019) 1 0.5 VQAv2 (Goyal et al., 2017) 1 0.5 COCO (Chen et al., 2015) 1 0.5 Wikipedia text 1 0.5 (robot) Mobile Manipulator, real 6 3.1 (robot) Language Table (Lynch et al., 2022), sim and real 8 4.2 (robot) TAMP, sim 3 1.6
Dataset Splits Yes Table 5: Results on general visual-language tasks. For the generalist models, they are the same checkpoint across the different evaluations, while task-specific finetuned models use differentfinetuned models for the different tasks. COCO uses Karpathy splits. OK-VQA Model test-dev test-std val Karpathy test
Hardware Specification No The paper mentions scaling up to '562B parameters' and integrating specific large models like '540B Pa LM' and '22B Vision Transformer', which implies the use of significant computational resources. However, it does not provide specific details about the hardware used to run its experiments, such as particular GPU models (e.g., NVIDIA A100), CPU types, or TPU versions.
Software Dependencies No The paper mentions using pre-trained models such as Pa LM (Chowdhery et al., 2022) and Vision Transformer (Vi T) (Dehghani et al., 2023), but it does not provide specific version numbers for any software dependencies, libraries (e.g., PyTorch, TensorFlow), or programming languages used in the experiments.
Experiment Setup No The paper describes general training strategies, such as training encoders 'end-to-end' and investigating 'freezing vs. finetuning the language model while training the encoders'. However, it does not provide specific experimental setup details like hyperparameter values (e.g., learning rate, batch size, number of epochs) or optimizer settings in the main text.