Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ParGo: Bridging Vision-Language with Partial and Global Views

Authors: An-Lan Wang, Bin Shan, Wei Shi, Kun-Yu Lin, Xiang Fei, Guozhi Tang, Lei Liao, Jingqun Tang, Can Huang, Wei-Shi Zheng

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on several MLLM benchmarks demonstrate the effectiveness of our Par Go, highlighting its superiority in aligning vision and language modalities. Compared to conventional Q-Former projector, our Par Go achieves an improvement of 259.96 in MME benchmark. Furthermore, our experiments reveal that Par Go significantly outperforms other projectors, particularly in tasks that emphasize detail perception ability.
Researcher Affiliation	Collaboration	1School of Computer Science and Engineering, Sun Yat-sen University, China 2Byte Dance China
Pseudocode	No	The paper describes the methodology using textual explanations and figures (Figure 2 illustrates the pipeline and attention masks), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and models https://github.com/bytedance/Par Go
Open Datasets	Yes	For pre-training, we use existing coarse-captioned data, including CC-3M, SBU Caption-1M, LAION-400M, and the constructed detail-captioned data Par Go Cap-1M-PT. For the supervised fine-tuning stage, [...] we employ four types of tasks, including: 1) Open-ended VQA, i.e. VQAv2 (Goyal et al. 2017), GQA (Hudson and Manning 2019), OCRVQA (Mishra et al. 2019), VSR (Liu, Emerson, and Collier 2023). 2) Multiple-Choice VQA, including Science QA (Lu et al. 2022), A-OKVQA (Schwenk et al. 2022). 3) Referring Expression Comprehension (REC), which includes Ref COCO (Kazemzadeh et al. 2014) and VG (Krishna et al. 2017). 4) Instruction tuning data, LLa VA-150k (Liu et al. 2023b).
Dataset Splits	Yes	For each dataset, we use the same templates as previous works (Liu et al. 2023a; Cha et al. 2023; Shan et al. 2024).
Hardware Specification	Yes	For all experiments, 32 A100 80GB GPUs are used.
Software Dependencies	Yes	We employ deepspeed zero-2 (Rajbhandari et al. 2020) and flash-attention v2 (Dao 2023) for all experiments
Experiment Setup	Yes	Model Configuration. We use pre-trained EVA-02-CLIPL/14 (Sun et al. 2023) with 336 resolution as the visual encoder. For the Large Language model, we employ the 7B Vicuna (Chiang et al. 2023) for a fair comparison. For the Partial-Global projector, six layers are utilized for all experiments if not otherwise specified. The number of partial and global tokens is 288 and 16, respectively, resulting in a total of 304 tokens. Training. The proposed Par Go is trained using a two-stage pipeline, i.e., coarse-detailed captioned pre-training and supervised fine-tuning. [...] we use parameter-efficient finetuning, i.e., Low-Rank Adaptation(Lo RA) (Hu et al. 2021), and the rank is set to 256. [...] The pre-training stage takes approximately 24 hours, while the supervised fine-tuning tasks require around 12 hours.