Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
ParGo: Bridging Vision-Language with Partial and Global Views
Authors: An-Lan Wang, Bin Shan, Wei Shi, Kun-Yu Lin, Xiang Fei, Guozhi Tang, Lei Liao, Jingqun Tang, Can Huang, Wei-Shi Zheng
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on several MLLM benchmarks demonstrate the effectiveness of our Par Go, highlighting its superiority in aligning vision and language modalities. Compared to conventional Q-Former projector, our Par Go achieves an improvement of 259.96 in MME benchmark. Furthermore, our experiments reveal that Par Go significantly outperforms other projectors, particularly in tasks that emphasize detail perception ability. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Engineering, Sun Yat-sen University, China 2Byte Dance China |
| Pseudocode | No | The paper describes the methodology using textual explanations and figures (Figure 2 illustrates the pipeline and attention masks), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and models https://github.com/bytedance/Par Go |
| Open Datasets | Yes | For pre-training, we use existing coarse-captioned data, including CC-3M, SBU Caption-1M, LAION-400M, and the constructed detail-captioned data Par Go Cap-1M-PT. For the supervised fine-tuning stage, [...] we employ four types of tasks, including: 1) Open-ended VQA, i.e. VQAv2 (Goyal et al. 2017), GQA (Hudson and Manning 2019), OCRVQA (Mishra et al. 2019), VSR (Liu, Emerson, and Collier 2023). 2) Multiple-Choice VQA, including Science QA (Lu et al. 2022), A-OKVQA (Schwenk et al. 2022). 3) Referring Expression Comprehension (REC), which includes Ref COCO (Kazemzadeh et al. 2014) and VG (Krishna et al. 2017). 4) Instruction tuning data, LLa VA-150k (Liu et al. 2023b). |
| Dataset Splits | Yes | For each dataset, we use the same templates as previous works (Liu et al. 2023a; Cha et al. 2023; Shan et al. 2024). |
| Hardware Specification | Yes | For all experiments, 32 A100 80GB GPUs are used. |
| Software Dependencies | Yes | We employ deepspeed zero-2 (Rajbhandari et al. 2020) and flash-attention v2 (Dao 2023) for all experiments |
| Experiment Setup | Yes | Model Configuration. We use pre-trained EVA-02-CLIPL/14 (Sun et al. 2023) with 336 resolution as the visual encoder. For the Large Language model, we employ the 7B Vicuna (Chiang et al. 2023) for a fair comparison. For the Partial-Global projector, six layers are utilized for all experiments if not otherwise specified. The number of partial and global tokens is 288 and 16, respectively, resulting in a total of 304 tokens. Training. The proposed Par Go is trained using a two-stage pipeline, i.e., coarse-detailed captioned pre-training and supervised fine-tuning. [...] we use parameter-efficient finetuning, i.e., Low-Rank Adaptation(Lo RA) (Hu et al. 2021), and the rank is set to 256. [...] The pre-training stage takes approximately 24 hours, while the supervised fine-tuning tasks require around 12 hours. |