Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Training data-efficient image transformers & distillation through attention
Authors: Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Herve Jegou
ICML 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we produce competitive convolutionfree transformers trained on Image Net only using a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop) on Image Net with no external data. We also introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention, typically from a convnet teacher. The learned transformers are competitive (85.2% top-1 acc.) with the state of the art on Image Net, and similarly when transferred to other tasks. |
| Researcher Affiliation | Collaboration | 1Facebook AI 2Sorbonne University. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We will share our code and models. |
| Open Datasets | Yes | In this work, we produce competitive convolutionfree transformers trained on Image Net only using a single computer in less than 3 days." and "It uses Imagenet as the sole training set. |
| Dataset Splits | Yes | Image Net validation set" (Section 5.3). The paper uses standard benchmarks like ImageNet-1k, which have well-defined, commonly used validation sets. |
| Hardware Specification | Yes | The throughput is measured as the number of images processed per second on a V100 GPU." and "on a single 8GPU node" and "one 16GB V100 GPU. |
| Software Dependencies | No | We build upon Py Torch (Paszke et al., 2019) and the timm library (Wightman, 2019)." (Does not provide specific version numbers for these software components). |
| Experiment Setup | Yes | Table 9 indicates the hyper-parameters that we use by default at training time for all our experiments, unless stated otherwise. For distillation we follow the recommendations from Cho & Hariharan (2019) to select the parameters τ and λ. We take the typical values τ = 3.0 or τ = 1.0 and λ = 0.1 for the usual (soft) distillation. We scale the learning rate according to the batch size with the formula: lrscaled = lr 512 batchsize, similarly to Goyal et al. (2017) except that we use 512 instead of 256 as the base value. |