Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Large-Scale Adversarial Training for Vision-and-Language Representation Learning
Authors: Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, Jingjing Liu
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For evaluation, we mostly focus on UNITER [10], the current best-performing V+L model with state-of-the-art performance across many popular V+L benchmarks, and enhance UNITER with VILLA through comprehensive experiments on six V+L tasks: VQA [15], VCR [72], NLVR2 [54], Visual Entailment [66], Referring Expression Comprehension [69], and Image-Text Retrieval [27]. |
| Researcher Affiliation | Collaboration | Zhe Gan1, Yen-Chun Chen1, Linjie Li1, Chen Zhu2, Yu Cheng1, Jingjing Liu1 1Microsoft Dynamics 365 AI Research, 2University of Maryland, College Park EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Free Multi-modal Adversarial Training used in VILLA. |
| Open Source Code | Yes | Code is available at https://github.com/zhegan27/VILLA. |
| Open Datasets | Yes | For UNITER experiments, we pre-train with the same four large-scale datasets used in the original model: COCO [33], Visual Genome (VG) [26], Conceptual Captions [52] and SBU Captions [45]. |
| Dataset Splits | Yes | For VQA, an internal val set is used. (Figure 2 caption, page 5); Method VQA VCR NLVR2 SNLI-VE test-dev test-std Q A QA R Q AR dev test-P val test (Table 1(a) header, page 4) |
| Hardware Specification | No | The paper discusses computational cost and large-scale training but does not specify any particular hardware (e.g., GPU models, CPU types) used for experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | UNITER-base is a single-stream model, which has 12 layers, with 768 hidden units per layer and 12 attention heads; UNITER-large has 24 layers, with 1024 hidden units per layer and 16 attention heads. The original UNITER-base (12 layers) and UNITER-large (24 layers) models take 200k and 500k steps for pre-training, respectively. For fair comparison, when applying VILLA to UNITER-base, we run 100k steps of standard training, followed by 100k steps of adversarial training. When applying VILLA to UNITER-large, to save pre-training time,2 we run 425k steps of standard training, followed by 75k steps of adversarial training. We typically select adversarial learning rate from {1e-2, 1e-3}, adversarial training steps to 3, and α (Eqn. 2) from 1.0, 1.5, 2.0. |