reproducibilityindex.ai

PaLM-E: An Embodied Multimodal Language Model

Authors: Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluations show that Pa LM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model beneﬁts from diverse joint training across internetscale language, vision, and visual-language domains. Our largest model with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.
Researcher Affiliation	Collaboration	1Robotics at Google 2TU Berlin 3Google Research. Correspondence to: Danny Driess <danny.driess@gmail.com>, Pete Florence <peteflorence@google.com>.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks (e.g., clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code	No	The paper refers to 'https://palm-e.github.io for videos showing the capabilities of Pa LM-E on those tasks.' This link points to a project page with videos and information, but it is not an explicit statement or a direct link to the source code for the methodology described in the paper.
Open Datasets	Yes	Tab. 6 shows the dataset and sampling frequency for the full mixture as referred to in the experiments. The majority of the data distribution is general vision-language tasks, with less than 10% robot data. Dataset in full mixture Sampling frequency % Webli (Chen et al., 2022) 100 52.4 VQ2A (Changpinyo et al., 2022) 25 13.1 VQG (Changpinyo et al., 2022) 10 5.2 CC3M (Sharma et al., 2018) 25 13.1 Object Aware (Piergiovanni et al., 2022) 10 5.2 OK-VQA (Marino et al., 2019) 1 0.5 VQAv2 (Goyal et al., 2017) 1 0.5 COCO (Chen et al., 2015) 1 0.5 Wikipedia text 1 0.5 (robot) Mobile Manipulator, real 6 3.1 (robot) Language Table (Lynch et al., 2022), sim and real 8 4.2 (robot) TAMP, sim 3 1.6
Dataset Splits	Yes	Table 5: Results on general visual-language tasks. For the generalist models, they are the same checkpoint across the different evaluations, while task-speciﬁc ﬁnetuned models use differentﬁnetuned models for the different tasks. COCO uses Karpathy splits. OK-VQA Model test-dev test-std val Karpathy test
Hardware Specification	No	The paper mentions scaling up to '562B parameters' and integrating specific large models like '540B Pa LM' and '22B Vision Transformer', which implies the use of significant computational resources. However, it does not provide specific details about the hardware used to run its experiments, such as particular GPU models (e.g., NVIDIA A100), CPU types, or TPU versions.
Software Dependencies	No	The paper mentions using pre-trained models such as Pa LM (Chowdhery et al., 2022) and Vision Transformer (Vi T) (Dehghani et al., 2023), but it does not provide specific version numbers for any software dependencies, libraries (e.g., PyTorch, TensorFlow), or programming languages used in the experiments.
Experiment Setup	No	The paper describes general training strategies, such as training encoders 'end-to-end' and investigating 'freezing vs. ﬁnetuning the language model while training the encoders'. However, it does not provide specific experimental setup details like hyperparameter values (e.g., learning rate, batch size, number of epochs) or optimizer settings in the main text.