MuEP: A Multimodal Benchmark for Embodied Planning with Foundation Models

Authors: Kanxue Li, Baosheng Yu, Qi Zheng, Yibing Zhan, Yuhui Zhang, Tianle Zhang, Yijun Yang, Yue Chen, Lei Sun, Qiong Cao, Li Shen, Lusong Li, Dapeng Tao, Xiaodong He

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that foundation models based on textual representations of environments usually outperform their visual counterparts, suggesting a gap in embodied planning abilities with multimodal observations.
Researcher Affiliation Collaboration Kanxue Li1,2,6 , Baosheng Yu3 , Qi Zheng4 , Yibing Zhan2, , Yuhui Zhang3,2 , Tianle Zhang2 , Yijun Yang5 , Yue Chen2 , Lei Sun2 , Qiong Cao2 , Li Shen2 , Lusong Li2 , Dapeng Tao1,6 and Xiaodong He2 1Yunnan University 2JD Explore Academy 3University of Sydney 4Shenzhen University 5University of Technology Sydney 6Yunnan Key Laboratory of Media Convergence
Pseudocode No The paper describes procedures and pipelines in text and diagrams (Figure 2, Figure 5), but it does not include formal pseudocode or algorithm blocks.
Open Source Code Yes The source code and datasets are available: https://github.com/kanxueli/Mu EP.
Open Datasets Yes The source code and datasets are available: https://github.com/kanxueli/Mu EP. We utilized ALFWorld scenes for testing and its annotated tasks to generate new ones with LLM. ALFWorld [Shridhar et al., 2021].
Dataset Splits No The paper mentions "134 #Unseen and 140 #Seen test tasks" for evaluation and fine-tuning models on the Mu EP dataset, but it does not explicitly provide training, validation, and test dataset splits with percentages, counts, or references to predefined validation splits.
Hardware Specification Yes All experiments were accelerated by four Tesla V100 GPUs.
Software Dependencies No The paper mentions using specific models like "GPT" and fine-tuning frameworks like "Q-Lo RA" but does not specify version numbers for general software dependencies, programming languages, or libraries like PyTorch, TensorFlow, etc.
Experiment Setup Yes Each model is evaluated under a constraint of a maximum of 30 steps per task. For LMMs, core components like the Vi T, Q-Former, and LLM itself were frozen to maintain stability, with the projection layer fine-tuned. Notably, the Lla Ma adapter v2 model only underwent fine-tuning on the bias of adapter layers. As illustrated on the right side of Figure 5, we employed Q-Lo RA [Dettmers et al., 2023] to fine-tune all LLMs efficiently.