ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
Authors: Wonjae Kim, Bokyung Son, Ildoo Kim
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Vi LT on two widely explored types of visionand-language downstream tasks: for classification, we use VQAv2 (Goyal et al., 2017) and NLVR2 (Suhr et al., 2018), and for retrieval, we use MSCOCO and Flickr30K (F30K) (Plummer et al., 2015) re-splited by Karpathy & Fei-Fei (2015). ... In Table 5, we perform various ablations. |
| Researcher Affiliation | Industry | Current affiliation: NAVER AI Lab, Seongnam, Gyeonggi, Republic of Korea. 1Kakao Enterprise, Seongnam, Gyeonggi, Republic of Korea 2Kakao Brain, Seongnam, Gyeonggi, Republic of Korea. Correspondence to: Wonjae Kim <wonjae.kim@navercorp.com>. |
| Pseudocode | No | The paper describes the model architecture and training process in text and diagrams, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and pre-trained weights are available at https://github.com/dandelin/vilt. |
| Open Datasets | Yes | We use four datasets for pre-training: Microsoft COCO (MSCOCO) (Lin et al., 2014), Visual Genome (VG) (Krishna et al., 2017), SBU Captions (SBU) (Ordonez et al., 2011), and Google Conceptual Captions (GCC) (Sharma et al., 2018). ... We evaluate Vi LT on two widely explored types of visionand-language downstream tasks: for classification, we use VQAv2 (Goyal et al., 2017) and NLVR2 (Suhr et al., 2018), and for retrieval, we use MSCOCO and Flickr30K (F30K) (Plummer et al., 2015) re-splited by Karpathy & Fei-Fei (2015). |
| Dataset Splits | Yes | For the classification tasks, we fine-tune three times with different initialization seeds for the head and data ordering and report the mean scores. ... For the retrieval tasks, we only fine-tune once. ... We fine-tune Vi LT-B/32 on the Karpathy & Fei-Fei (2015) split of MSCOCO and F30K. ... reserving 1,000 validation images and their related questions for internal validation. |
| Hardware Specification | Yes | We pre-train Vi LT-B/32 for 100K or 200K steps on 64 NVIDIA V100 GPUs... The latency is averaged over 10K times on a Xeon E5-2650 CPU and an NVIDIA P40 GPU. |
| Software Dependencies | No | The paper mentions using specific models like pre-trained BERT and ViT, and an AdamW optimizer, but it does not specify version numbers for general software dependencies or libraries used in the implementation (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | For all experiments, we use Adam W optimizer (Loshchilov & Hutter, 2018) with base learning rate of 10 4 and weight decay of 10 2. ... We pre-train Vi LT-B/32 for 100K or 200K steps... with a batch size of 4,096. For all downstream tasks, we train for ten epochs with a batch size of 256 for VQAv2/retrieval tasks and 128 for NLVR2. ... Hidden size H is 768, layer depth D is 12, patch size P is 32, MLP size is 3,072, and the number of attention heads is 12. ... We use a two-layer MLP of hidden size 1,536 as the fine-tuned downstream head. ... We apply Rand Augment (Cubuk et al., 2020) during fine-tuning. ... We use N = 2, M = 9 as the hyperparameters. ... We mask whole words with a mask probability of 0.15 during pre-training. |