OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Authors: Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, Shuicheng Yan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show the effectiveness of our components and training strategy. In addition to visual segmentation, OMG-LLa VA can also achieve good enough performance on 6 datasets, including COCO panoptic segmentation, VIPSeg video panoptic segmentation, ref COCO, ref COCO+, ref COCOg referring expression segmentation, Gran Df grounded conversation generation, and ref COCOg region caption datasets.
Researcher Affiliation Collaboration Tao Zhang1, Xiangtai Li2,4 , Hao Fei2, Haobo Yuan3, Shengqiong Wu2, Shunping Ji1, Chen Change Loy3, Shuicheng Yan2 1Wuhan University 2Skywork AI 3S-Lab, NTU 4Bytedance
Pseudocode No The paper describes the model architecture and training process in detail, but it does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes The code and model have been released for further research.
Open Datasets Yes During the pretraining stage, we use the LLa VA pretraining dataset [71] to perform visual-text alignment, following LLa VA. ... For pixel-level understanding and reasoning, we use the referring segmentation datasets, including ref COCO, ref COCO+ [42], ref COCOg [121], and ref Clef, totaling 74K data. Additionally, semantic segmentation datasets, including ADE20k [137] and COCO-stuff [7], totaling 26K data, and the grounded conversation generation dataset Gran Df [87], containing 200K data, are used.
Dataset Splits Yes We evaluate OMG-LLa VA on ref COCO, ref COCO+, and ref COCOg, with the results shown in Tab. 3. ... OMG-LLa VA outperforms LISA [49] by 1.5 c Io U, 3.2 c Io U, and 4.3 c Io U on the validation sets of ref COCO, ref COCO+, and ref COCOg, respectively, while keeping the OMG decoder frozen and using only a single visual encoder.
Hardware Specification Yes All training is conducted on four NVIDIA A800 GPUs with 80GB of memory.
Software Dependencies Yes We adopt xtuner codebase [22] to build our model and data pipeline.
Experiment Setup Yes During the pretraining stage, only the visual projector and text projector are trained, with an initial learning rate set to 1e-3. ... During the instruction tuning stage, the initial learning rate is set to 2e-4, with only the perception model kept frozen, and the LLM is fine-tuned using Lo RA [37]. The maximum sequence length in the LLM is set to 2,048.