Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Authors: Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, Jianfeng Gao, Lijuan Wang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data.
Researcher Affiliation Collaboration Microsoft University of California, Los Angeles New York University {zdou,violetpeng}@cs.ucla.edu, {aish,yann.lecun}@nyu.edu, pengchuanzhang@fb.com {zhgan,jianfw,linjli,zliu,liuce,jfgao,lijuanw}@microsoft.com
Pseudocode No The paper describes the model architecture and processes using equations and descriptive text, but it does not include a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Code is available at https://github.com/microsoft/FIBER.
Open Datasets Yes Pre-training Datasets. Following previous work [6, 28, 32, 13, 64, 66], we perform coarse-grained pre-training on COCO [41], Conceptual Captions [56], SBU Captions [47], and Visual Genome [29].
Dataset Splits Yes Pre-training Datasets. Following previous work [6, 28, 32, 13, 64, 66], we perform coarse-grained pre-training on COCO [41], Conceptual Captions [56], SBU Captions [47], and Visual Genome [29]. The four datasets consist of about 4M images in total. For fine-grained pretraining, we use two data sources: data curated by MDETR [26] after removing the COCO images, and the Objects365 [55] detection dataset, together consisting of about 0.8M images. We ensure that we exclude any data that exists in the validation or test splits of downstream tasks.
Hardware Specification Yes We perform coarse-grained pre-training for 100k steps with 4,096 batch size on 64 A100 GPUs. For fine-grained pre-training, we train for 800k steps on 64 V100 GPUs, with a batch size of 64.
Software Dependencies No The paper mentions using specific optimizers (Adam W) and model architectures (RoBERTa, Swin Transformer) but does not provide specific version numbers for software dependencies such as Python, PyTorch, or other libraries.
Experiment Setup Yes Implementation Details. We perform coarse-grained pre-training for 100k steps with 4,096 batch size on 64 A100 GPUs. We use Adam W [44] with the peak learning rates of 1e-4 for the backbones and 5e-4 for the cross-modal parameters. We use linear warmup over the first 1k steps and linear decay. For fine-grained pre-training, we train for 800k steps on 64 V100 GPUs, with a batch size of 64. We use a learning rate of 1e-5 for the language backbone, and 1e-4 for the rest of the model with a weight decay of 0.01. We use a linear warmup over the first 2k steps and then a constant learning rate, with two learning rate drops by a factor of 10 at 67% and 89% of the total number of steps.