Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Authors: Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, Steven Chu Hong Hoi
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of ALBEF on various downstream V+L tasks including image-text retrieval, visual question answering, visual reasoning, visual entailment, and weakly-supervised visual grounding. ALBEF achieves substantial improvements over existing state-of-the-art methods. |
| Researcher Affiliation | Industry | Salesforce Research {junnan.li,rselvaraju,akhilesh.gotmare,sjoty,shoi}@salesforce.com |
| Pseudocode | No | The paper does not contain any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and models are available at https://github.com/salesforce/ALBEF. |
| Open Datasets | Yes | Following UNITER [2], we construct our pre-training data using two web datasets (Conceptual Captions [4], SBU Captions [5]) and two in-domain datasets (COCO [41] and Visual Genome [42]). |
| Dataset Splits | Yes | We evaluate ALBEF on the Flickr30K [49] and COCO benchmarks, and fine-tune the pretrained model using the training samples from each dataset. Table 4: Comparison with state-of-the-art methods on downstream vision-language tasks. (Contains columns for 'test-dev', 'test-std dev', 'test-P', 'val') Table 7 studies the effect of textassignment (TA) pre-training and parameter sharing on NLVR2. (Contains columns for 'dev') |
| Hardware Specification | Yes | We pre-train the model for 30 epochs using a batch size of 512 on 8 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions software components like BERTbase, Adam W, and Rand Augment, but does not provide specific version numbers for these or for core dependencies like Python, PyTorch/TensorFlow, or CUDA. |
| Experiment Setup | Yes | We pre-train the model for 30 epochs using a batch size of 512 on 8 NVIDIA A100 GPUs. We use the Adam W [44] optimizer with a weight decay of 0.02. The learning rate is warmed-up to 1e 4 in the first 1000 iterations, and decayed to 1e 5 following a cosine schedule. During pre-training, we take random image crops of resolution 256 256 as input, and also apply Rand Augment4 [45]. During fine-tuning, we increase the image resolution to 384 384 and interpolate the positional encoding of image patches following [38]. The momentum parameter for updating the momentum model is set as 0.995, and the size of the queue used for image-text contrastive learning is set as 65,536. We linearly ramp-up the distillation weight from 0 to 0.4 within the 1st epoch. |