Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Authors: Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, Steven Chu Hong Hoi

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of ALBEF on various downstream V+L tasks including image-text retrieval, visual question answering, visual reasoning, visual entailment, and weakly-supervised visual grounding. ALBEF achieves substantial improvements over existing state-of-the-art methods.
Researcher Affiliation Industry Salesforce Research {junnan.li,rselvaraju,akhilesh.gotmare,sjoty,shoi}@salesforce.com
Pseudocode No The paper does not contain any explicit pseudocode or algorithm blocks.
Open Source Code Yes Code and models are available at https://github.com/salesforce/ALBEF.
Open Datasets Yes Following UNITER [2], we construct our pre-training data using two web datasets (Conceptual Captions [4], SBU Captions [5]) and two in-domain datasets (COCO [41] and Visual Genome [42]).
Dataset Splits Yes We evaluate ALBEF on the Flickr30K [49] and COCO benchmarks, and fine-tune the pretrained model using the training samples from each dataset. Table 4: Comparison with state-of-the-art methods on downstream vision-language tasks. (Contains columns for 'test-dev', 'test-std dev', 'test-P', 'val') Table 7 studies the effect of textassignment (TA) pre-training and parameter sharing on NLVR2. (Contains columns for 'dev')
Hardware Specification Yes We pre-train the model for 30 epochs using a batch size of 512 on 8 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions software components like BERTbase, Adam W, and Rand Augment, but does not provide specific version numbers for these or for core dependencies like Python, PyTorch/TensorFlow, or CUDA.
Experiment Setup Yes We pre-train the model for 30 epochs using a batch size of 512 on 8 NVIDIA A100 GPUs. We use the Adam W [44] optimizer with a weight decay of 0.02. The learning rate is warmed-up to 1e 4 in the first 1000 iterations, and decayed to 1e 5 following a cosine schedule. During pre-training, we take random image crops of resolution 256 256 as input, and also apply Rand Augment4 [45]. During fine-tuning, we increase the image resolution to 384 384 and interpolate the positional encoding of image patches following [38]. The momentum parameter for updating the momentum model is set as 0.995, and the size of the queue used for image-text contrastive learning is set as 65,536. We linearly ramp-up the distillation weight from 0 to 0.4 within the 1st epoch.