Bootstrapping Vision-Language Learning with Decoupled Language Pre-training

Authors: Yiren Jian, Chongyang Gao, Soroush Vosoughi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments reveal that our framework significantly enhances the performance of a robust image-to-text baseline (BLIP-2), and effectively narrows the performance gap between models trained with either 4M or 129M image-text pairs.
Researcher Affiliation Academia Yiren Jian1 Chongyang Gao2 Soroush Vosoughi1 1Dartmouth College 2Northwestern University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The code will be made available at https://github.com/yiren-jian/BLIText.
Open Datasets Yes For VL pre-training, we widely adapted academic setting (since academic institutions lack the resources available to industry researchers to use very large datasets) with approximately 4M image-text pairs. This set comprises the MSCOCO-80K [39], VG-100K [28], CC-3M [53], and SBU-1M [47] datasets.
Dataset Splits Yes We evaluate our fine-tuned model on the Karpathy test split of MSCOCO. Also, zero-shot transfer results on the No Caps dataset [1] are reported. Shown in Table 2, our framework improves BLIP-2 in all metrics, with greater improvements in CIDEr compared to SPICE.
Hardware Specification Yes The P-Former is trained on a system with 3 RTX-A6000 (48GB) GPUs, using Py Torch [48]. We trained for five epochs with a linear warm-up and cosine scheduling, using a batch size of 384 (3 128), and Adam W as the optimizer. [...] The VL pre-training is performed on a server equipped with 8 RTX-A6000 (48GB) GPUs, using Py Torch.
Software Dependencies No The paper mentions 'Py Torch [48]' but does not specify a version number for it or any other software dependencies crucial for reproduction.
Experiment Setup Yes The P-Former is trained on a system with 3 RTX-A6000 (48GB) GPUs, using Py Torch [48]. We trained for five epochs with a linear warm-up and cosine scheduling, using a batch size of 384 (3 128), and Adam W as the optimizer. The initial learning rate is set to 1e 4, with a minimum learning rate of 1e 5, a warm-up learning rate of 1e 6, and 2000 warm-up steps. [...] Both the stage 1 and stage 2 training ran for 10 epochs with linear warm-up and cosine scheduling, using a batch size of 1024 (8 128), and Adam W as the optimizer. The weight decay is set to 0.05, the initial learning rate is 1e 4, the minimum learning rate is 1e 5, and the warm-up learning rate is 1e 6. The key distinction is that stage 1 and stage 2 incorporate 5000 and 2000 warm-up steps, respectively. We set ω1 = 10 and ω2 = 100 while training BLIP-2 OPT2.7B with our P-Former.