Large-Scale Adversarial Training for Vision-and-Language Representation Learning

Authors: Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, Jingjing Liu

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For evaluation, we mostly focus on UNITER [10], the current best-performing V+L model with state-of-the-art performance across many popular V+L benchmarks, and enhance UNITER with VILLA through comprehensive experiments on six V+L tasks: VQA [15], VCR [72], NLVR2 [54], Visual Entailment [66], Referring Expression Comprehension [69], and Image-Text Retrieval [27].
Researcher Affiliation Collaboration Zhe Gan1, Yen-Chun Chen1, Linjie Li1, Chen Zhu2, Yu Cheng1, Jingjing Liu1 1Microsoft Dynamics 365 AI Research, 2University of Maryland, College Park {zhe.gan,yen-chun.chen,lindsey.li,yu.cheng,jingjl}@microsoft.com chenzhu@cs.umd.edu
Pseudocode Yes Algorithm 1 Free Multi-modal Adversarial Training used in VILLA.
Open Source Code Yes Code is available at https://github.com/zhegan27/VILLA.
Open Datasets Yes For UNITER experiments, we pre-train with the same four large-scale datasets used in the original model: COCO [33], Visual Genome (VG) [26], Conceptual Captions [52] and SBU Captions [45].
Dataset Splits Yes For VQA, an internal val set is used. (Figure 2 caption, page 5); Method VQA VCR NLVR2 SNLI-VE test-dev test-std Q A QA R Q AR dev test-P val test (Table 1(a) header, page 4)
Hardware Specification No The paper discusses computational cost and large-scale training but does not specify any particular hardware (e.g., GPU models, CPU types) used for experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes UNITER-base is a single-stream model, which has 12 layers, with 768 hidden units per layer and 12 attention heads; UNITER-large has 24 layers, with 1024 hidden units per layer and 16 attention heads. The original UNITER-base (12 layers) and UNITER-large (24 layers) models take 200k and 500k steps for pre-training, respectively. For fair comparison, when applying VILLA to UNITER-base, we run 100k steps of standard training, followed by 100k steps of adversarial training. When applying VILLA to UNITER-large, to save pre-training time,2 we run 425k steps of standard training, followed by 75k steps of adversarial training. We typically select adversarial learning rate from {1e-2, 1e-3}, adversarial training steps to 3, and α (Eqn. 2) from 1.0, 1.5, 2.0.