Training data-efficient image transformers & distillation through attention

Authors: Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Herve Jegou

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we produce competitive convolutionfree transformers trained on Image Net only using a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop) on Image Net with no external data. We also introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention, typically from a convnet teacher. The learned transformers are competitive (85.2% top-1 acc.) with the state of the art on Image Net, and similarly when transferred to other tasks.
Researcher Affiliation Collaboration 1Facebook AI 2Sorbonne University.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes We will share our code and models.
Open Datasets Yes In this work, we produce competitive convolutionfree transformers trained on Image Net only using a single computer in less than 3 days." and "It uses Imagenet as the sole training set.
Dataset Splits Yes Image Net validation set" (Section 5.3). The paper uses standard benchmarks like ImageNet-1k, which have well-defined, commonly used validation sets.
Hardware Specification Yes The throughput is measured as the number of images processed per second on a V100 GPU." and "on a single 8GPU node" and "one 16GB V100 GPU.
Software Dependencies No We build upon Py Torch (Paszke et al., 2019) and the timm library (Wightman, 2019)." (Does not provide specific version numbers for these software components).
Experiment Setup Yes Table 9 indicates the hyper-parameters that we use by default at training time for all our experiments, unless stated otherwise. For distillation we follow the recommendations from Cho & Hariharan (2019) to select the parameters τ and λ. We take the typical values τ = 3.0 or τ = 1.0 and λ = 0.1 for the usual (soft) distillation. We scale the learning rate according to the batch size with the formula: lrscaled = lr 512 batchsize, similarly to Goyal et al. (2017) except that we use 512 instead of 256 as the base value.