Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Authors: Yiren Jian, Chongyang Gao, Soroush Vosoughi
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments reveal that our framework significantly enhances the performance of a robust image-to-text baseline (BLIP-2), and effectively narrows the performance gap between models trained with either 4M or 129M image-text pairs. |
| Researcher Affiliation | Academia | Yiren Jian1 Chongyang Gao2 Soroush Vosoughi1 1Dartmouth College 2Northwestern University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The code will be made available at https://github.com/yiren-jian/BLIText. |
| Open Datasets | Yes | For VL pre-training, we widely adapted academic setting (since academic institutions lack the resources available to industry researchers to use very large datasets) with approximately 4M image-text pairs. This set comprises the MSCOCO-80K [39], VG-100K [28], CC-3M [53], and SBU-1M [47] datasets. |
| Dataset Splits | Yes | We evaluate our fine-tuned model on the Karpathy test split of MSCOCO. Also, zero-shot transfer results on the No Caps dataset [1] are reported. Shown in Table 2, our framework improves BLIP-2 in all metrics, with greater improvements in CIDEr compared to SPICE. |
| Hardware Specification | Yes | The P-Former is trained on a system with 3 RTX-A6000 (48GB) GPUs, using Py Torch [48]. We trained for five epochs with a linear warm-up and cosine scheduling, using a batch size of 384 (3 128), and Adam W as the optimizer. [...] The VL pre-training is performed on a server equipped with 8 RTX-A6000 (48GB) GPUs, using Py Torch. |
| Software Dependencies | No | The paper mentions 'Py Torch [48]' but does not specify a version number for it or any other software dependencies crucial for reproduction. |
| Experiment Setup | Yes | The P-Former is trained on a system with 3 RTX-A6000 (48GB) GPUs, using Py Torch [48]. We trained for five epochs with a linear warm-up and cosine scheduling, using a batch size of 384 (3 128), and Adam W as the optimizer. The initial learning rate is set to 1e 4, with a minimum learning rate of 1e 5, a warm-up learning rate of 1e 6, and 2000 warm-up steps. [...] Both the stage 1 and stage 2 training ran for 10 epochs with linear warm-up and cosine scheduling, using a batch size of 1024 (8 128), and Adam W as the optimizer. The weight decay is set to 0.05, the initial learning rate is 1e 4, the minimum learning rate is 1e 5, and the warm-up learning rate is 1e 6. The key distinction is that stage 1 and stage 2 incorporate 5000 and 2000 warm-up steps, respectively. We set ω1 = 10 and ω2 = 100 while training BLIP-2 OPT2.7B with our P-Former. |