Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Authors: Yiren Jian, Chongyang Gao, Soroush Vosoughi
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments reveal that our framework significantly enhances the performance of a robust image-to-text baseline (BLIP-2), and effectively narrows the performance gap between models trained with either 4M or 129M image-text pairs. |
| Researcher Affiliation | Academia | Yiren Jian1 Chongyang Gao2 Soroush Vosoughi1 1Dartmouth College 2Northwestern University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The code will be made available at https://github.com/yiren-jian/BLIText. |
| Open Datasets | Yes | For VL pre-training, we widely adapted academic setting (since academic institutions lack the resources available to industry researchers to use very large datasets) with approximately 4M image-text pairs. This set comprises the MSCOCO-80K [39], VG-100K [28], CC-3M [53], and SBU-1M [47] datasets. |
| Dataset Splits | Yes | We evaluate our fine-tuned model on the Karpathy test split of MSCOCO. Also, zero-shot transfer results on the No Caps dataset [1] are reported. Shown in Table 2, our framework improves BLIP-2 in all metrics, with greater improvements in CIDEr compared to SPICE. |
| Hardware Specification | Yes | The P-Former is trained on a system with 3 RTX-A6000 (48GB) GPUs, using Py Torch [48]. We trained for five epochs with a linear warm-up and cosine scheduling, using a batch size of 384 (3 128), and Adam W as the optimizer. [...] The VL pre-training is performed on a server equipped with 8 RTX-A6000 (48GB) GPUs, using Py Torch. |
| Software Dependencies | No | The paper mentions 'Py Torch [48]' but does not specify a version number for it or any other software dependencies crucial for reproduction. |
| Experiment Setup | Yes | The P-Former is trained on a system with 3 RTX-A6000 (48GB) GPUs, using Py Torch [48]. We trained for five epochs with a linear warm-up and cosine scheduling, using a batch size of 384 (3 128), and Adam W as the optimizer. The initial learning rate is set to 1e 4, with a minimum learning rate of 1e 5, a warm-up learning rate of 1e 6, and 2000 warm-up steps. [...] Both the stage 1 and stage 2 training ran for 10 epochs with linear warm-up and cosine scheduling, using a batch size of 1024 (8 128), and Adam W as the optimizer. The weight decay is set to 0.05, the initial learning rate is 1e 4, the minimum learning rate is 1e 5, and the warm-up learning rate is 1e 6. The key distinction is that stage 1 and stage 2 incorporate 5000 and 2000 warm-up steps, respectively. We set ω1 = 10 and ω2 = 100 while training BLIP-2 OPT2.7B with our P-Former. |