reproducibilityindex.ai

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

Authors: Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, Steven Chu Hong Hoi

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of ALBEF on various downstream V+L tasks including image-text retrieval, visual question answering, visual reasoning, visual entailment, and weakly-supervised visual grounding. ALBEF achieves substantial improvements over existing state-of-the-art methods.
Researcher Affiliation	Industry	Salesforce Research {junnan.li,rselvaraju,akhilesh.gotmare,sjoty,shoi}@salesforce.com
Pseudocode	No	The paper does not contain any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Code and models are available at https://github.com/salesforce/ALBEF.
Open Datasets	Yes	Following UNITER [2], we construct our pre-training data using two web datasets (Conceptual Captions [4], SBU Captions [5]) and two in-domain datasets (COCO [41] and Visual Genome [42]).
Dataset Splits	Yes	We evaluate ALBEF on the Flickr30K [49] and COCO benchmarks, and ﬁne-tune the pretrained model using the training samples from each dataset. Table 4: Comparison with state-of-the-art methods on downstream vision-language tasks. (Contains columns for 'test-dev', 'test-std dev', 'test-P', 'val') Table 7 studies the effect of textassignment (TA) pre-training and parameter sharing on NLVR2. (Contains columns for 'dev')
Hardware Specification	Yes	We pre-train the model for 30 epochs using a batch size of 512 on 8 NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions software components like BERTbase, Adam W, and Rand Augment, but does not provide specific version numbers for these or for core dependencies like Python, PyTorch/TensorFlow, or CUDA.
Experiment Setup	Yes	We pre-train the model for 30 epochs using a batch size of 512 on 8 NVIDIA A100 GPUs. We use the Adam W [44] optimizer with a weight decay of 0.02. The learning rate is warmed-up to 1e 4 in the ﬁrst 1000 iterations, and decayed to 1e 5 following a cosine schedule. During pre-training, we take random image crops of resolution 256 256 as input, and also apply Rand Augment4 [45]. During ﬁne-tuning, we increase the image resolution to 384 384 and interpolate the positional encoding of image patches following [38]. The momentum parameter for updating the momentum model is set as 0.995, and the size of the queue used for image-text contrastive learning is set as 65,536. We linearly ramp-up the distillation weight from 0 to 0.4 within the 1st epoch.