EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
Authors: Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, Ping Luo
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show the effectiveness of Embodied GPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering. Notably, Embodied GPT significantly enhances the success rate of the embodied control task by extracting more effective features. It has achieved a remarkable 1.6 times increase in success rate on the Franka Kitchen benchmark and a 1.3 times increase on the Meta-World benchmark, compared to the BLIP-2 baseline fine-tuned with the Ego4D dataset. |
| Researcher Affiliation | Collaboration | 1The University of Hong Kong, 2Shanghai AI Laboratory, 3The Chinese University of Hong Kong, 4Noah s Ark Laboratory 5Tsinghua University |
| Pseudocode | No | The paper includes examples of prompts (Listing 1 and Listing 2) but does not contain any structured pseudocode or algorithm blocks describing the methodology. |
| Open Source Code | Yes | More demos, code, and dataset information can be found at our homepage. |
| Open Datasets | Yes | For our Ego COT dataset, we obtain basic data from the Ego4D dataset [16]... The first two stages focus on pre-training in basic cognitive and responsive skills, while the third stage involves training the embodied AI task with egocentric video-text data on Ego COT. In the first stage, we focus on image-text conversation alignment pre-training, which involves using three datasets: COCO Caption [44], 595 thousand finely filtered image-text pairs from CC3M [45], and 491 thousand filtered image-text pairs obtained by re-captioning LAION-400M using BLIP-2 [17]. |
| Dataset Splits | No | The paper describes using various datasets for pre-training and few-shot learning with demonstrations, but it does not provide specific train/validation/test dataset splits (e.g., percentages or sample counts) for reproducibility. |
| Hardware Specification | No | The paper mentions using a 'Franka Emika robot arm' for real-world experiments but does not provide specific details about the computing hardware (e.g., GPU models, CPU types, or memory) used for training or inference of the models. |
| Software Dependencies | No | The paper mentions using pre-trained models and datasets (e.g., BLIP-2, ViT-G/14, LLaMA-7B) but does not provide specific version numbers for software dependencies or libraries (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | To further enhance the diversity of generated chain of thoughts, we employ a temperature parameter of 0.9 and a top-p parameter of 0.95. For each prompt, we perform five sampling iterations. ... We employ Conv3D [47] to adapt the pre-trained vision model from stage 2 for video encoding, using a total of eight evenly distributed keyframes from each video. ... The 3D patches are subsequently encoded into visual tokens via the Conv3D module with a time offset of 2 and are then integrated into the internal vision transformer. |