Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Octopus: A Multi-modal LLM with Parallel Recognition and Sequential Understanding
Authors: Chuyang Zhao, YuXin Song, Junru Chen, KANG RONG, Haocheng Feng, Gang Zhang, Shufan Ji, Jingdong Wang, Errui Ding, Yifan Sun
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical studies show our MLLM named Octopus improves accuracy on popular MLLM tasks and is up to 5 faster on visual grounding tasks. |
| Researcher Affiliation | Collaboration | 1Baidu VIS 2 Beihang University Equal Contribution |
| Pseudocode | No | The paper describes its structure and training process, but it does not include a distinct block or figure specifically labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | Our code is currently being organized. We will release the code and data after the review process is complete. |
| Open Datasets | Yes | We train Octopus via three stages... Stage-1... using LLa VA pretraining data [18]... Stage-2 pretrains the DETR module on small-scale grounding detection datasets (Ref COCO [41], Ref COCO+ [42], Ref COCOg [42] and Flickr30k [43])... In stage-3, we finetune... on LLa VA-Instruct [18], REC data (Ref COCO, Ref COCO+, Ref COCOg, Visual Genome [44]), and Flickr30k end-to-end. |
| Dataset Splits | Yes | We evaluate Octopus on the val split of Ref COCOg benchmark and observe that 64 and 80 object query achieved the best performance in our setup. |
| Hardware Specification | Yes | The entire training takes about 2 hours for Stage-1 (1 epoch), 4 hours for Stage-2 (2 epochs) and 120 hours for Stage-3 on 8 NVIDIA A100 GPUS. |
| Software Dependencies | No | The paper mentions software components like 'Vicuna-7B-v1.5' and 'AdamW' for optimization, but it does not provide specific version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | We adopt Adam W as the optimizer and cosine annealing scheduler. The learning rate is initialized to 1e-4 for stage-1 and stage-2, and 2e-5 for stage-3. ... We employ 64 object queries and place the DETR decoder after the 16-th LLM layer. |