reproducibilityindex.ai

Training data-efficient image transformers & distillation through attention

Authors: Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Herve Jegou

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we produce competitive convolutionfree transformers trained on Image Net only using a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop) on Image Net with no external data. We also introduce a teacher-student strategy speciﬁc to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention, typically from a convnet teacher. The learned transformers are competitive (85.2% top-1 acc.) with the state of the art on Image Net, and similarly when transferred to other tasks.
Researcher Affiliation	Collaboration	1Facebook AI 2Sorbonne University.
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We will share our code and models.
Open Datasets	Yes	In this work, we produce competitive convolutionfree transformers trained on Image Net only using a single computer in less than 3 days." and "It uses Imagenet as the sole training set.
Dataset Splits	Yes	Image Net validation set" (Section 5.3). The paper uses standard benchmarks like ImageNet-1k, which have well-defined, commonly used validation sets.
Hardware Specification	Yes	The throughput is measured as the number of images processed per second on a V100 GPU." and "on a single 8GPU node" and "one 16GB V100 GPU.
Software Dependencies	No	We build upon Py Torch (Paszke et al., 2019) and the timm library (Wightman, 2019)." (Does not provide specific version numbers for these software components).
Experiment Setup	Yes	Table 9 indicates the hyper-parameters that we use by default at training time for all our experiments, unless stated otherwise. For distillation we follow the recommendations from Cho & Hariharan (2019) to select the parameters τ and λ. We take the typical values τ = 3.0 or τ = 1.0 and λ = 0.1 for the usual (soft) distillation. We scale the learning rate according to the batch size with the formula: lrscaled = lr 512 batchsize, similarly to Goyal et al. (2017) except that we use 512 instead of 256 as the base value.