VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
Authors: Teng Wang, Wenhao Jiang, Zhichao Lu, Feng Zheng, Ran Cheng, Chengguo Yin, Ping Luo
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on five downstream tasks show that VLMixer could surpass previous state-of-the-art unpaired VLP methods. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science and Engineering, Southern University of Science and Technology 2Department of Computer Science, The University of Hong Kong 3Data Platform, Tencent. |
| Pseudocode | Yes | Algorithm 1 Unpaired VLP via CMC |
| Open Source Code | Yes | Project page: https: //github.com/ttengwang/VLMixer |
| Open Datasets | Yes | We use a variety of datasets covering diverse visual and language patterns. Specifically, three kinds of pre-training datasets are taken into account: image-text pairs, imageonly collections and text-only corpora. The paired VL datasets contain COCO Captions (Lin et al., 2014), Visual Genome (Krishna et al., 2017b), Conceptual Captions 3M (Sharma et al., 2018), SBU Captions (Ordonez et al., 2011), Flickr30K (Plummer et al., 2015), and GQA (Hudson & Manning, 2019)... |
| Dataset Splits | No | No explicit training/validation/test dataset splits (e.g., percentages, sample counts, or specific split files) are provided in the paper. It refers to following 'fine-tuning strategy and evaluation metrics in Zhang et al. (2021) for downstream tasks', which might contain such details, but they are not stated directly here. |
| Hardware Specification | Yes | The training time on the full pre-training data is around six days on 16 Telsa A100 GPUs. |
| Software Dependencies | No | The paper mentions using 'lower-case byte pair encoding (BPE)' and initializes VLMixer from 'BERTbase', but does not provide specific version numbers for software dependencies such as deep learning frameworks (e.g., PyTorch, TensorFlow) or CUDA. |
| Experiment Setup | Yes | We use a Base Transformer with 12 layers of transformer block and a hidden size of 768 as the backbone. ... The replacing probability rcmc in CMC is set to 0.5 and the context weight rctx is set to 0.5. The temperature ratio τ in CMCL is set to 0.1. ... We initialize VLMixer from the parameters of BERTbase, and pre-train the model on unpaired image and text data for a maximum of 300k steps. An Adam optimizer is adopted with an initial learning rate of 5e-5 and a mini-batch size of 1024. The warm-up rate is set to 10%. |