Octopus: A Multi-modal LLM with Parallel Recognition and Sequential Understanding
Authors: Chuyang Zhao, YuXin Song, Junru Chen, KANG RONG, Haocheng Feng, Gang Zhang, Shufan Ji, Jingdong Wang, Errui Ding, Yifan Sun
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical studies show our MLLM named Octopus improves accuracy on popular MLLM tasks and is up to 5 faster on visual grounding tasks. |
| Researcher Affiliation | Collaboration | 1Baidu VIS 2 Beihang University Equal Contribution |
| Pseudocode | No | The paper describes its structure and training process, but it does not include a distinct block or figure specifically labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | Our code is currently being organized. We will release the code and data after the review process is complete. |
| Open Datasets | Yes | We train Octopus via three stages... Stage-1... using LLa VA pretraining data [18]... Stage-2 pretrains the DETR module on small-scale grounding detection datasets (Ref COCO [41], Ref COCO+ [42], Ref COCOg [42] and Flickr30k [43])... In stage-3, we finetune... on LLa VA-Instruct [18], REC data (Ref COCO, Ref COCO+, Ref COCOg, Visual Genome [44]), and Flickr30k end-to-end. |
| Dataset Splits | Yes | We evaluate Octopus on the val split of Ref COCOg benchmark and observe that 64 and 80 object query achieved the best performance in our setup. |
| Hardware Specification | Yes | The entire training takes about 2 hours for Stage-1 (1 epoch), 4 hours for Stage-2 (2 epochs) and 120 hours for Stage-3 on 8 NVIDIA A100 GPUS. |
| Software Dependencies | No | The paper mentions software components like 'Vicuna-7B-v1.5' and 'AdamW' for optimization, but it does not provide specific version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | We adopt Adam W as the optimizer and cosine annealing scheduler. The learning rate is initialized to 1e-4 for stage-1 and stage-2, and 2e-5 for stage-3. ... We employ 64 object queries and place the DETR decoder after the 16-th LLM layer. |