CoBIT: A Contrastive Bi-directional Image-Text Generation Model

Authors: Haoxuan You, Mandy Guo, Zhecan Wang, Kai-Wei Chang, Jason Michael Baldridge, Jiahui Yu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments demonstrate Co BIT s superior performance, and more importantly, first time verifies the compatibility of the three objectives. Benefiting from the compatible objectives, Co BIT subsumes strong zero-shot and transferable capacities of unimodal visual understanding, image-text matching, image-text understanding, and text-to-image generation. For example, Co BIT achieves 82.7% accuracy in zero-shot Image Net classification, 9.37 FID in zero-shot text-to-image generation, 44.8 CIDEr score in zero-shot image-to-text captioning. After fine-tuning, Co BIT further achieves 86.44% linear probing accuracy on Image Net, 4.62 FID on text-to-image generation, and 78.3 VQA score.
Researcher Affiliation Collaboration 1Columbia University, 2Google Research, 3UCLA haoxuan.you@cs.columbia.edu, {xyguo,jasonbaldridge,jiahuiyu}@google.com
Pseudocode No The paper describes the model architecture and pre-training details but does not include any explicit pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes For contrastive loss and I2T loss, we use a mixture of ALIGN dataset (Jia et al., 2021), and JFT-4B dataset (Zhai et al., 2022a)... Instead, we replace JFT with Web LI dataset (Chen et al., 2022), and mix it with ALIGN for T2I generation loss. ... In the end, we obtain 1.1B pairs from ALIGN dataset, 162M pairs from Web LI dataset, and 4B pairs from JFT-4B dataset.
Dataset Splits Yes We follow the standard evaluation protocols as in CLIP, ALIGN, etc (details in Appendix 6.4.2)." (Zero-shot Image Classification) and "We fine-tune all parameters of Co BIT and evaluate it on the val/test set." (Image-Text Understanding)
Hardware Specification Yes Co BIT-Base/Co BIT-Large takes around 12 days on 256/512 Cloud TPUv4 chips.
Software Dependencies No Co BIT is implemented using Pax (Team, 2023), a Jax-based framework." - While software is mentioned, specific version numbers for Pax or Jax are not provided, which is required for reproducibility.
Experiment Setup Yes Within each batch, for optimizing T2I loss, we sample 1,024 image-text pairs from a mixture of ALIGN and Web LI datasets, and for optimizing contrastive and I2T losses, we sample 30,720 image-text pairs from a mixture of ALIGN and JFT datasets. In total, the batch size is 31,744. We use the Adafactor (Shazeer & Stern, 2018) optimizer with β1 = 0.9, β2 = 0.96 and a weight decay of 0.045. As for the learning rate schedule, we warm it up to 4.5e-5 in the first 5,000 steps and then use an exponential decay starting from the step of 85,000. In total, models are pre-trained for 1M steps... In default, we set λT2I : λI2T : λCon = 1 : 0.2 : 0.1." (Section 4.1 and 3.3) and Table 7 detailing hyperparameters for fine-tuning.