Masked Vision and Language Modeling for Multi-modal Representation Learning
Authors: Gukyeong Kwon, Zhaowei Cai, Avinash Ravichandran, Erhan Bas, Rahul Bhotika, Stefano Soatto
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on various V+L tasks show that the proposed method, along with common V+L alignment losses, achieves state-of-the-art performance in the regime of millions of pre-training data. |
| Researcher Affiliation | Industry | Gukyeong Kwon, Zhaowei Cai, Avinash Ravichandran, Erhan Bas, Rahul Bhotika, Stefano Soatto AWS AI Labs {gukyeong,zhaoweic,soattos}@amazon.com |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are provided in the paper. |
| Open Source Code | No | No explicit statement about releasing source code for the described methodology or a direct link to a code repository is found in the paper. |
| Open Datasets | Yes | We use the union of four datasets for pre-training... These datasets are Conceptual Captions (CC) (Sharma et al., 2018), SBU Captions (Ordonez et al., 2011), Visual Genome (VG) (Krishna et al., 2017), and COCO Captions (Lin et al., 2014). |
| Dataset Splits | Yes | To be specific, we follow data splits proposed in (Karpathy & Fei-Fei, 2015) and an average recall over image and text retrieval is used to find the best model in the validation set. |
| Hardware Specification | Yes | A batch size of 512 is used with 16 NVIDIA Tesla V100 GPUs. |
| Software Dependencies | No | We used the Image Net pretrained Vi T (vit base patch16 224) from (Wightman, 2019) and the pre-trained Ro BERTa (roberta-base) from Hugging Face (Wolf et al., 2020). |
| Experiment Setup | Yes | We pre-train the model for 50 epochs when the 4M dataset is used and 30 epochs for all other experiments. A batch size of 512 is used with 16 NVIDIA Tesla V100 GPUs. All parameters are optimized using Adam W (Loshchilov & Hutter, 2017) with a weight decay of 0.05. Following (Xie et al., 2021), we use the image masking ratio of 60%. While 15% masking ratio is used for text in language models (Devlin et al., 2018; Liu et al., 2019), we use 30% since the paired image can provide additional information for text reconstruction. During pre-training, the learning rate is warmed up to 3 × 10−4 in the first 5 epochs and decayed to 3 × 10−5 using a cosine scheduler. The learning rates for the image encoder and the text encoder are set to 10−5, which is less than that of the cross-modality encoders. An image size of 224 × 224 and Rand Augment (Cubuk et al., 2020) are used. |