OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Authors: Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, Shuicheng Yan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show the effectiveness of our components and training strategy. In addition to visual segmentation, OMG-LLa VA can also achieve good enough performance on 6 datasets, including COCO panoptic segmentation, VIPSeg video panoptic segmentation, ref COCO, ref COCO+, ref COCOg referring expression segmentation, Gran Df grounded conversation generation, and ref COCOg region caption datasets. |
| Researcher Affiliation | Collaboration | Tao Zhang1, Xiangtai Li2,4 , Hao Fei2, Haobo Yuan3, Shengqiong Wu2, Shunping Ji1, Chen Change Loy3, Shuicheng Yan2 1Wuhan University 2Skywork AI 3S-Lab, NTU 4Bytedance |
| Pseudocode | No | The paper describes the model architecture and training process in detail, but it does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and model have been released for further research. |
| Open Datasets | Yes | During the pretraining stage, we use the LLa VA pretraining dataset [71] to perform visual-text alignment, following LLa VA. ... For pixel-level understanding and reasoning, we use the referring segmentation datasets, including ref COCO, ref COCO+ [42], ref COCOg [121], and ref Clef, totaling 74K data. Additionally, semantic segmentation datasets, including ADE20k [137] and COCO-stuff [7], totaling 26K data, and the grounded conversation generation dataset Gran Df [87], containing 200K data, are used. |
| Dataset Splits | Yes | We evaluate OMG-LLa VA on ref COCO, ref COCO+, and ref COCOg, with the results shown in Tab. 3. ... OMG-LLa VA outperforms LISA [49] by 1.5 c Io U, 3.2 c Io U, and 4.3 c Io U on the validation sets of ref COCO, ref COCO+, and ref COCOg, respectively, while keeping the OMG decoder frozen and using only a single visual encoder. |
| Hardware Specification | Yes | All training is conducted on four NVIDIA A800 GPUs with 80GB of memory. |
| Software Dependencies | Yes | We adopt xtuner codebase [22] to build our model and data pipeline. |
| Experiment Setup | Yes | During the pretraining stage, only the visual projector and text projector are trained, with an initial learning rate set to 1e-3. ... During the instruction tuning stage, the initial learning rate is set to 2e-4, with only the perception model kept frozen, and the LLM is fine-tuned using Lo RA [37]. The maximum sequence length in the LLM is set to 2,048. |