Efficient Training of Visual Transformers with Small Datasets

Authors: Yahui Liu, Enver Sangineto, Wei Bi, Nicu Sebe, Bruno Lepri, Marco Nadai

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we empirically analyse different VTs, comparing their robustness in a small training set regime, and we show that, despite having a comparable accuracy when trained on Image Net, their performance on smaller datasets can be largely different. Moreover, we propose an auxiliary selfsupervised task which can extract additional information from images with only a negligible computational overhead. This task encourages the VTs to learn spatial relations within an image and makes the VT training much more robust when training data is scarce. Our task is used jointly with the standard (supervised) training and it does not depend on specific architectural choices, thus it can be easily plugged in the existing VTs. Using an extensive evaluation with different VTs and datasets, we show that our method can improve (sometimes dramatically) the final accuracy of the VTs.
Researcher Affiliation Collaboration Yahui Liu University of Trento Fondazione Bruno Kessler yahui.liu@unitn.it Enver Sangineto University of Trento enver.sangineto@unitn.it Wei Bi Tencent AI Lab victoriabi@tencent.com Nicu Sebe University of Trento niculae.sebe@unitn.it Bruno Lepri Fondazione Bruno Kessler lepri@fbk.eu Marco De Nadai Fondazione Bruno Kessler work@marcodena.it
Pseudocode No The paper describes methods and processes in text and with diagrams, but it does not contain any formal pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at: https://github.com/ yhlleo/VTs-Drloc.
Open Datasets Yes We use 11 different datasets: Image Net-100 (IN-100) [52, 56], which is a subset of 100 classes of Image Net-1K [48]; CIFAR-10 and CIFAR-100 [31], Oxford Flowers102 [41] and SVHN [40], which are four widely used computer vision datasets; and the six datasets of Domain Net [44], a benchmark commonly used for domain adaptation tasks. We chose the latter because of the large domain-shift between some of its datasets and Image Net-1K, which makes the fine-tuning experiments non-trivial. Tab. 1 shows the size of each dataset.
Dataset Splits No Table 1 lists 'Train size' and 'Test size' for the datasets used, but does not explicitly provide details for a separate validation split, nor does it describe the methodology for creating one if used (e.g., from the training set).
Hardware Specification Yes We train each model using 8 V100 32GB GPUs.
Software Dependencies No The paper does not explicitly list specific software dependencies with their version numbers (e.g., Python version, PyTorch version, CUDA version).
Experiment Setup Yes Using the results of Tab. 2 (a) (based on CIFAR-100 and Swin), we chose m = 64 for all the VTs and all the datasets. Moreover, Tab. 2 (b) shows the influence of the loss weight λ (Section 4) for each of the three baselines, which motivates our choice of using λ = 0.1 for both Cv T and T2T and λ = 0.5 for Swin. These values of m and λ are kept fixed in all the other experiments of this paper, independently of the dataset, the main task (e.g., classification, detection, segmentation, etc.), and the training protocol (from scratch or fine-tuning). We train each model using 8 V100 32GB GPUs.