VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

Authors: Teng Wang, Wenhao Jiang, Zhichao Lu, Feng Zheng, Ran Cheng, Chengguo Yin, Ping Luo

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on five downstream tasks show that VLMixer could surpass previous state-of-the-art unpaired VLP methods.
Researcher Affiliation Collaboration 1Department of Computer Science and Engineering, Southern University of Science and Technology 2Department of Computer Science, The University of Hong Kong 3Data Platform, Tencent.
Pseudocode Yes Algorithm 1 Unpaired VLP via CMC
Open Source Code Yes Project page: https: //github.com/ttengwang/VLMixer
Open Datasets Yes We use a variety of datasets covering diverse visual and language patterns. Specifically, three kinds of pre-training datasets are taken into account: image-text pairs, imageonly collections and text-only corpora. The paired VL datasets contain COCO Captions (Lin et al., 2014), Visual Genome (Krishna et al., 2017b), Conceptual Captions 3M (Sharma et al., 2018), SBU Captions (Ordonez et al., 2011), Flickr30K (Plummer et al., 2015), and GQA (Hudson & Manning, 2019)...
Dataset Splits No No explicit training/validation/test dataset splits (e.g., percentages, sample counts, or specific split files) are provided in the paper. It refers to following 'fine-tuning strategy and evaluation metrics in Zhang et al. (2021) for downstream tasks', which might contain such details, but they are not stated directly here.
Hardware Specification Yes The training time on the full pre-training data is around six days on 16 Telsa A100 GPUs.
Software Dependencies No The paper mentions using 'lower-case byte pair encoding (BPE)' and initializes VLMixer from 'BERTbase', but does not provide specific version numbers for software dependencies such as deep learning frameworks (e.g., PyTorch, TensorFlow) or CUDA.
Experiment Setup Yes We use a Base Transformer with 12 layers of transformer block and a hidden size of 768 as the backbone. ... The replacing probability rcmc in CMC is set to 0.5 and the context weight rctx is set to 0.5. The temperature ratio τ in CMCL is set to 0.1. ... We initialize VLMixer from the parameters of BERTbase, and pre-train the model on unpaired image and text data for a maximum of 300k steps. An Adam optimizer is adopted with an initial learning rate of 5e-5 and a mini-batch size of 1024. The warm-up rate is set to 10%.