Octopus: A Multi-modal LLM with Parallel Recognition and Sequential Understanding

Authors: Chuyang Zhao, YuXin Song, Junru Chen, KANG RONG, Haocheng Feng, Gang Zhang, Shufan Ji, Jingdong Wang, Errui Ding, Yifan Sun

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical studies show our MLLM named Octopus improves accuracy on popular MLLM tasks and is up to 5 faster on visual grounding tasks.
Researcher Affiliation Collaboration 1Baidu VIS 2 Beihang University Equal Contribution
Pseudocode No The paper describes its structure and training process, but it does not include a distinct block or figure specifically labeled 'Pseudocode' or 'Algorithm'.
Open Source Code No Our code is currently being organized. We will release the code and data after the review process is complete.
Open Datasets Yes We train Octopus via three stages... Stage-1... using LLa VA pretraining data [18]... Stage-2 pretrains the DETR module on small-scale grounding detection datasets (Ref COCO [41], Ref COCO+ [42], Ref COCOg [42] and Flickr30k [43])... In stage-3, we finetune... on LLa VA-Instruct [18], REC data (Ref COCO, Ref COCO+, Ref COCOg, Visual Genome [44]), and Flickr30k end-to-end.
Dataset Splits Yes We evaluate Octopus on the val split of Ref COCOg benchmark and observe that 64 and 80 object query achieved the best performance in our setup.
Hardware Specification Yes The entire training takes about 2 hours for Stage-1 (1 epoch), 4 hours for Stage-2 (2 epochs) and 120 hours for Stage-3 on 8 NVIDIA A100 GPUS.
Software Dependencies No The paper mentions software components like 'Vicuna-7B-v1.5' and 'AdamW' for optimization, but it does not provide specific version numbers for these or other key software dependencies.
Experiment Setup Yes We adopt Adam W as the optimizer and cosine annealing scheduler. The learning rate is initialized to 1e-4 for stage-1 and stage-2, and 2e-5 for stage-3. ... We employ 64 object queries and place the DETR decoder after the 16-th LLM layer.