reproducibilityindex.ai

Efficient Training of Visual Transformers with Small Datasets

Authors: Yahui Liu, Enver Sangineto, Wei Bi, Nicu Sebe, Bruno Lepri, Marco Nadai

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we empirically analyse different VTs, comparing their robustness in a small training set regime, and we show that, despite having a comparable accuracy when trained on Image Net, their performance on smaller datasets can be largely different. Moreover, we propose an auxiliary selfsupervised task which can extract additional information from images with only a negligible computational overhead. This task encourages the VTs to learn spatial relations within an image and makes the VT training much more robust when training data is scarce. Our task is used jointly with the standard (supervised) training and it does not depend on speciﬁc architectural choices, thus it can be easily plugged in the existing VTs. Using an extensive evaluation with different VTs and datasets, we show that our method can improve (sometimes dramatically) the ﬁnal accuracy of the VTs.
Researcher Affiliation	Collaboration	Yahui Liu University of Trento Fondazione Bruno Kessler yahui.liu@unitn.it Enver Sangineto University of Trento enver.sangineto@unitn.it Wei Bi Tencent AI Lab victoriabi@tencent.com Nicu Sebe University of Trento niculae.sebe@unitn.it Bruno Lepri Fondazione Bruno Kessler lepri@fbk.eu Marco De Nadai Fondazione Bruno Kessler work@marcodena.it
Pseudocode	No	The paper describes methods and processes in text and with diagrams, but it does not contain any formal pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at: https://github.com/ yhlleo/VTs-Drloc.
Open Datasets	Yes	We use 11 different datasets: Image Net-100 (IN-100) [52, 56], which is a subset of 100 classes of Image Net-1K [48]; CIFAR-10 and CIFAR-100 [31], Oxford Flowers102 [41] and SVHN [40], which are four widely used computer vision datasets; and the six datasets of Domain Net [44], a benchmark commonly used for domain adaptation tasks. We chose the latter because of the large domain-shift between some of its datasets and Image Net-1K, which makes the ﬁne-tuning experiments non-trivial. Tab. 1 shows the size of each dataset.
Dataset Splits	No	Table 1 lists 'Train size' and 'Test size' for the datasets used, but does not explicitly provide details for a separate validation split, nor does it describe the methodology for creating one if used (e.g., from the training set).
Hardware Specification	Yes	We train each model using 8 V100 32GB GPUs.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with their version numbers (e.g., Python version, PyTorch version, CUDA version).
Experiment Setup	Yes	Using the results of Tab. 2 (a) (based on CIFAR-100 and Swin), we chose m = 64 for all the VTs and all the datasets. Moreover, Tab. 2 (b) shows the inﬂuence of the loss weight λ (Section 4) for each of the three baselines, which motivates our choice of using λ = 0.1 for both Cv T and T2T and λ = 0.5 for Swin. These values of m and λ are kept ﬁxed in all the other experiments of this paper, independently of the dataset, the main task (e.g., classiﬁcation, detection, segmentation, etc.), and the training protocol (from scratch or ﬁne-tuning). We train each model using 8 V100 32GB GPUs.