MuEP: A Multimodal Benchmark for Embodied Planning with Foundation Models
Authors: Kanxue Li, Baosheng Yu, Qi Zheng, Yibing Zhan, Yuhui Zhang, Tianle Zhang, Yijun Yang, Yue Chen, Lei Sun, Qiong Cao, Li Shen, Lusong Li, Dapeng Tao, Xiaodong He
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that foundation models based on textual representations of environments usually outperform their visual counterparts, suggesting a gap in embodied planning abilities with multimodal observations. |
| Researcher Affiliation | Collaboration | Kanxue Li1,2,6 , Baosheng Yu3 , Qi Zheng4 , Yibing Zhan2, , Yuhui Zhang3,2 , Tianle Zhang2 , Yijun Yang5 , Yue Chen2 , Lei Sun2 , Qiong Cao2 , Li Shen2 , Lusong Li2 , Dapeng Tao1,6 and Xiaodong He2 1Yunnan University 2JD Explore Academy 3University of Sydney 4Shenzhen University 5University of Technology Sydney 6Yunnan Key Laboratory of Media Convergence |
| Pseudocode | No | The paper describes procedures and pipelines in text and diagrams (Figure 2, Figure 5), but it does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code and datasets are available: https://github.com/kanxueli/Mu EP. |
| Open Datasets | Yes | The source code and datasets are available: https://github.com/kanxueli/Mu EP. We utilized ALFWorld scenes for testing and its annotated tasks to generate new ones with LLM. ALFWorld [Shridhar et al., 2021]. |
| Dataset Splits | No | The paper mentions "134 #Unseen and 140 #Seen test tasks" for evaluation and fine-tuning models on the Mu EP dataset, but it does not explicitly provide training, validation, and test dataset splits with percentages, counts, or references to predefined validation splits. |
| Hardware Specification | Yes | All experiments were accelerated by four Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions using specific models like "GPT" and fine-tuning frameworks like "Q-Lo RA" but does not specify version numbers for general software dependencies, programming languages, or libraries like PyTorch, TensorFlow, etc. |
| Experiment Setup | Yes | Each model is evaluated under a constraint of a maximum of 30 steps per task. For LMMs, core components like the Vi T, Q-Former, and LLM itself were frozen to maintain stability, with the projection layer fine-tuned. Notably, the Lla Ma adapter v2 model only underwent fine-tuning on the bias of adapter layers. As illustrated on the right side of Figure 5, we employed Q-Lo RA [Dettmers et al., 2023] to fine-tune all LLMs efficiently. |