reproducibilityindex.ai

Octopus: A Multi-modal LLM with Parallel Recognition and Sequential Understanding

Authors: Chuyang Zhao, YuXin Song, Junru Chen, KANG RONG, Haocheng Feng, Gang Zhang, Shufan Ji, Jingdong Wang, Errui Ding, Yifan Sun

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical studies show our MLLM named Octopus improves accuracy on popular MLLM tasks and is up to 5 faster on visual grounding tasks.
Researcher Affiliation	Collaboration	1Baidu VIS 2 Beihang University Equal Contribution
Pseudocode	No	The paper describes its structure and training process, but it does not include a distinct block or figure specifically labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	No	Our code is currently being organized. We will release the code and data after the review process is complete.
Open Datasets	Yes	We train Octopus via three stages... Stage-1... using LLa VA pretraining data [18]... Stage-2 pretrains the DETR module on small-scale grounding detection datasets (Ref COCO [41], Ref COCO+ [42], Ref COCOg [42] and Flickr30k [43])... In stage-3, we finetune... on LLa VA-Instruct [18], REC data (Ref COCO, Ref COCO+, Ref COCOg, Visual Genome [44]), and Flickr30k end-to-end.
Dataset Splits	Yes	We evaluate Octopus on the val split of Ref COCOg benchmark and observe that 64 and 80 object query achieved the best performance in our setup.
Hardware Specification	Yes	The entire training takes about 2 hours for Stage-1 (1 epoch), 4 hours for Stage-2 (2 epochs) and 120 hours for Stage-3 on 8 NVIDIA A100 GPUS.
Software Dependencies	No	The paper mentions software components like 'Vicuna-7B-v1.5' and 'AdamW' for optimization, but it does not provide specific version numbers for these or other key software dependencies.
Experiment Setup	Yes	We adopt Adam W as the optimizer and cosine annealing scheduler. The learning rate is initialized to 1e-4 for stage-1 and stage-2, and 2e-5 for stage-3. ... We employ 64 object queries and place the DETR decoder after the 16-th LLM layer.