reproducibilityindex.ai

MuEP: A Multimodal Benchmark for Embodied Planning with Foundation Models

Authors: Kanxue Li, Baosheng Yu, Qi Zheng, Yibing Zhan, Yuhui Zhang, Tianle Zhang, Yijun Yang, Yue Chen, Lei Sun, Qiong Cao, Li Shen, Lusong Li, Dapeng Tao, Xiaodong He

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that foundation models based on textual representations of environments usually outperform their visual counterparts, suggesting a gap in embodied planning abilities with multimodal observations.
Researcher Affiliation	Collaboration	Kanxue Li1,2,6 , Baosheng Yu3 , Qi Zheng4 , Yibing Zhan2, , Yuhui Zhang3,2 , Tianle Zhang2 , Yijun Yang5 , Yue Chen2 , Lei Sun2 , Qiong Cao2 , Li Shen2 , Lusong Li2 , Dapeng Tao1,6 and Xiaodong He2 1Yunnan University 2JD Explore Academy 3University of Sydney 4Shenzhen University 5University of Technology Sydney 6Yunnan Key Laboratory of Media Convergence
Pseudocode	No	The paper describes procedures and pipelines in text and diagrams (Figure 2, Figure 5), but it does not include formal pseudocode or algorithm blocks.
Open Source Code	Yes	The source code and datasets are available: https://github.com/kanxueli/Mu EP.
Open Datasets	Yes	The source code and datasets are available: https://github.com/kanxueli/Mu EP. We utilized ALFWorld scenes for testing and its annotated tasks to generate new ones with LLM. ALFWorld [Shridhar et al., 2021].
Dataset Splits	No	The paper mentions "134 #Unseen and 140 #Seen test tasks" for evaluation and fine-tuning models on the Mu EP dataset, but it does not explicitly provide training, validation, and test dataset splits with percentages, counts, or references to predefined validation splits.
Hardware Specification	Yes	All experiments were accelerated by four Tesla V100 GPUs.
Software Dependencies	No	The paper mentions using specific models like "GPT" and fine-tuning frameworks like "Q-Lo RA" but does not specify version numbers for general software dependencies, programming languages, or libraries like PyTorch, TensorFlow, etc.
Experiment Setup	Yes	Each model is evaluated under a constraint of a maximum of 30 steps per task. For LMMs, core components like the Vi T, Q-Former, and LLM itself were frozen to maintain stability, with the projection layer fine-tuned. Notably, the Lla Ma adapter v2 model only underwent fine-tuning on the bias of adapter layers. As illustrated on the right side of Figure 5, we employed Q-Lo RA [Dettmers et al., 2023] to fine-tune all LLMs efficiently.