Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models
Authors: Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experimental results on a series of vision-centric and VQA benchmarks indicate that our Lumen model not only achieves or surpasses the performance of existing LMM-based approaches in a range of vision-centric tasks while maintaining general visual understanding and instruction following capabilities. |
| Researcher Affiliation | Collaboration | Yang Jiao 1,2,3, Shaoxiang Chen3, Zequn Jie 3, Jingjing Chen 1,2 Lin Ma3, Yu-Gang Jiang1,2 1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2Shanghai Collaborative Innovation Center on Intelligent Visual Computing 3Meituan |
| Pseudocode | No | The paper includes architectural diagrams (e.g., Figure 5 'Detailed designs of V-L Dense Aligner'), but no structured pseudocode or algorithm blocks. |
| Open Source Code | No | Due to the privacy policy of the institution the authors collaborated with, the code should be published only after obtaining authorization. |
| Open Datasets | Yes | Our training data consists of datasets from the following different tasks. (1) For object detection, we use MSCOCO [59], Objects365 [60] and Open Images [61]. (2) For visual grounding, we use Ref COCO, Ref CCO+ and Ref COCOg [62]. (3) For pose estimation, we use MSCOCO keypoints [59] and AIC [63]. (4) For visual question answering, we employ a subset of Share GPT4V dataset [33] with 665K samples. |
| Dataset Splits | Yes | Evaluation Metrics. We adopt evaluation metrics commonly used within each field of task. For object detection and instance segmentation, we use m AP based on box Io U and mask Io U, respectively. For pose estimation, we use m AP based on OKS (object keypoint similarity). For visual questionanswering, we comply with the evaluation protocol of each individual benchmark. ... COCO val set is used for evaluation. ... Ref COCOg val set for comparison... MMBench dev, SEEDBench image, MME test, MMMU val and Math Vista mini sets for evaluation. |
| Hardware Specification | Yes | We set the batch size to 160 and train the first step for 50,000 steps and the second step for 10,000 steps on 8 NVIDIA 80G A100 GPUs. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions) are explicitly listed in the paper. |
| Experiment Setup | Yes | For the first task-agnostic matching stage, we utilize pre-trained CLIP Vi T-L/14-336 [55] and LLa VA-7B-v1.5 [4] as our vision encoder and large multimodal model, respectively. ... For the task-agnostic stage, our training comprises two phases. ... In Phase 1, we mix the object detection, visual grounding and pose estimation data with sampling rates of 0.69, 0.23, 0.08, respectively, for balanced data sampling. ... In Phase 2, we mix the visual question-answering, object detection, visual grounding and pose estimation data with sample rates of 0.67, 0.23, 0.07, 0.03, respectively. We set the batch size to 160 and train the first step for 50,000 steps and the second step for 10,000 steps on 8 NVIDIA 80G A100 GPUs. The loss function balance terms λh and λt are both set to 1. For each phase, we use Adam W as the optimizer with an initial learning rate of 3 10 4 and weight decay of 0. |