Large-Scale Adversarial Training for Vision-and-Language Representation Learning
Authors: Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, Jingjing Liu
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For evaluation, we mostly focus on UNITER [10], the current best-performing V+L model with state-of-the-art performance across many popular V+L benchmarks, and enhance UNITER with VILLA through comprehensive experiments on six V+L tasks: VQA [15], VCR [72], NLVR2 [54], Visual Entailment [66], Referring Expression Comprehension [69], and Image-Text Retrieval [27]. |
| Researcher Affiliation | Collaboration | Zhe Gan1, Yen-Chun Chen1, Linjie Li1, Chen Zhu2, Yu Cheng1, Jingjing Liu1 1Microsoft Dynamics 365 AI Research, 2University of Maryland, College Park {zhe.gan,yen-chun.chen,lindsey.li,yu.cheng,jingjl}@microsoft.com chenzhu@cs.umd.edu |
| Pseudocode | Yes | Algorithm 1 Free Multi-modal Adversarial Training used in VILLA. |
| Open Source Code | Yes | Code is available at https://github.com/zhegan27/VILLA. |
| Open Datasets | Yes | For UNITER experiments, we pre-train with the same four large-scale datasets used in the original model: COCO [33], Visual Genome (VG) [26], Conceptual Captions [52] and SBU Captions [45]. |
| Dataset Splits | Yes | For VQA, an internal val set is used. (Figure 2 caption, page 5); Method VQA VCR NLVR2 SNLI-VE test-dev test-std Q A QA R Q AR dev test-P val test (Table 1(a) header, page 4) |
| Hardware Specification | No | The paper discusses computational cost and large-scale training but does not specify any particular hardware (e.g., GPU models, CPU types) used for experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | UNITER-base is a single-stream model, which has 12 layers, with 768 hidden units per layer and 12 attention heads; UNITER-large has 24 layers, with 1024 hidden units per layer and 16 attention heads. The original UNITER-base (12 layers) and UNITER-large (24 layers) models take 200k and 500k steps for pre-training, respectively. For fair comparison, when applying VILLA to UNITER-base, we run 100k steps of standard training, followed by 100k steps of adversarial training. When applying VILLA to UNITER-large, to save pre-training time,2 we run 425k steps of standard training, followed by 75k steps of adversarial training. We typically select adversarial learning rate from {1e-2, 1e-3}, adversarial training steps to 3, and α (Eqn. 2) from 1.0, 1.5, 2.0. |