reproducibilityindex.ai

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Authors: Wonjae Kim, Bokyung Son, Ildoo Kim

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Vi LT on two widely explored types of visionand-language downstream tasks: for classiﬁcation, we use VQAv2 (Goyal et al., 2017) and NLVR2 (Suhr et al., 2018), and for retrieval, we use MSCOCO and Flickr30K (F30K) (Plummer et al., 2015) re-splited by Karpathy & Fei-Fei (2015). ... In Table 5, we perform various ablations.
Researcher Affiliation	Industry	Current afﬁliation: NAVER AI Lab, Seongnam, Gyeonggi, Republic of Korea. 1Kakao Enterprise, Seongnam, Gyeonggi, Republic of Korea 2Kakao Brain, Seongnam, Gyeonggi, Republic of Korea. Correspondence to: Wonjae Kim <wonjae.kim@navercorp.com>.
Pseudocode	No	The paper describes the model architecture and training process in text and diagrams, but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and pre-trained weights are available at https://github.com/dandelin/vilt.
Open Datasets	Yes	We use four datasets for pre-training: Microsoft COCO (MSCOCO) (Lin et al., 2014), Visual Genome (VG) (Krishna et al., 2017), SBU Captions (SBU) (Ordonez et al., 2011), and Google Conceptual Captions (GCC) (Sharma et al., 2018). ... We evaluate Vi LT on two widely explored types of visionand-language downstream tasks: for classiﬁcation, we use VQAv2 (Goyal et al., 2017) and NLVR2 (Suhr et al., 2018), and for retrieval, we use MSCOCO and Flickr30K (F30K) (Plummer et al., 2015) re-splited by Karpathy & Fei-Fei (2015).
Dataset Splits	Yes	For the classiﬁcation tasks, we ﬁne-tune three times with different initialization seeds for the head and data ordering and report the mean scores. ... For the retrieval tasks, we only ﬁne-tune once. ... We ﬁne-tune Vi LT-B/32 on the Karpathy & Fei-Fei (2015) split of MSCOCO and F30K. ... reserving 1,000 validation images and their related questions for internal validation.
Hardware Specification	Yes	We pre-train Vi LT-B/32 for 100K or 200K steps on 64 NVIDIA V100 GPUs... The latency is averaged over 10K times on a Xeon E5-2650 CPU and an NVIDIA P40 GPU.
Software Dependencies	No	The paper mentions using specific models like pre-trained BERT and ViT, and an AdamW optimizer, but it does not specify version numbers for general software dependencies or libraries used in the implementation (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	For all experiments, we use Adam W optimizer (Loshchilov & Hutter, 2018) with base learning rate of 10 4 and weight decay of 10 2. ... We pre-train Vi LT-B/32 for 100K or 200K steps... with a batch size of 4,096. For all downstream tasks, we train for ten epochs with a batch size of 256 for VQAv2/retrieval tasks and 128 for NLVR2. ... Hidden size H is 768, layer depth D is 12, patch size P is 32, MLP size is 3,072, and the number of attention heads is 12. ... We use a two-layer MLP of hidden size 1,536 as the ﬁne-tuned downstream head. ... We apply Rand Augment (Cubuk et al., 2020) during ﬁne-tuning. ... We use N = 2, M = 9 as the hyperparameters. ... We mask whole words with a mask probability of 0.15 during pre-training.