Training data-efficient image transformers & distillation through attention
Authors: Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Herve Jegou
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we produce competitive convolutionfree transformers trained on Image Net only using a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop) on Image Net with no external data. We also introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention, typically from a convnet teacher. The learned transformers are competitive (85.2% top-1 acc.) with the state of the art on Image Net, and similarly when transferred to other tasks. |
| Researcher Affiliation | Collaboration | 1Facebook AI 2Sorbonne University. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We will share our code and models. |
| Open Datasets | Yes | In this work, we produce competitive convolutionfree transformers trained on Image Net only using a single computer in less than 3 days." and "It uses Imagenet as the sole training set. |
| Dataset Splits | Yes | Image Net validation set" (Section 5.3). The paper uses standard benchmarks like ImageNet-1k, which have well-defined, commonly used validation sets. |
| Hardware Specification | Yes | The throughput is measured as the number of images processed per second on a V100 GPU." and "on a single 8GPU node" and "one 16GB V100 GPU. |
| Software Dependencies | No | We build upon Py Torch (Paszke et al., 2019) and the timm library (Wightman, 2019)." (Does not provide specific version numbers for these software components). |
| Experiment Setup | Yes | Table 9 indicates the hyper-parameters that we use by default at training time for all our experiments, unless stated otherwise. For distillation we follow the recommendations from Cho & Hariharan (2019) to select the parameters τ and λ. We take the typical values τ = 3.0 or τ = 1.0 and λ = 0.1 for the usual (soft) distillation. We scale the learning rate according to the batch size with the formula: lrscaled = lr 512 batchsize, similarly to Goyal et al. (2017) except that we use 512 instead of 256 as the base value. |