On the Surprising Effectiveness of Attention Transfer for Vision Transformers

Authors: Alex Li, Yuandong Tian, Beidi Chen, Deepak Pathak, Xinlei Chen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We systematically study various aspects of our findings on the sufficiency of attention maps, including distribution shift settings where they underperform fine-tuning. We hope our exploration provides a better understanding of what pre-training accomplishes and leads to a useful alternative to the standard practice of fine-tuning.
Researcher Affiliation Collaboration Alexander C. Li Carnegie Mellon University Yuandong Tian FAIR Beidi Chen Carnegie Mellon University Deepak Pathak Carnegie Mellon University Xinlei Chen FAIR
Pseudocode No The paper does not contain any sections explicitly labeled as 'Pseudocode' or 'Algorithm', nor are there any structured algorithm blocks presented in a pseudocode format.
Open Source Code Yes Code to reproduce our results is at https://github.com/alexlioralexli/attention-transfer.
Open Datasets Yes Compared to a Vi T-L trained from scratch (with an accuracy score of 83.0), fine-tuning the MAE pre-trained on the same dataset results in a significant improvement to 85.7.1... Image Net-1K classification [10].
Dataset Splits No The paper does not explicitly provide numerical details about training, validation, or test dataset splits (e.g., percentages or sample counts). It refers to standard datasets and benchmarks where splits are typically predefined, but does not specify them within the paper itself.
Hardware Specification Yes We compare fine-tuning vs attention distillation on a 16GB NVIDIA GP100 with Vi T-L and a batch size of 16:
Software Dependencies No The paper mentions several components like 'optimizer Adam W', 'Rand Aug', 'mixup', 'cutmix', and 'drop path' but does not specify their version numbers or the versions of overarching software environments like Python, PyTorch, or TensorFlow.
Experiment Setup Yes Appendix C provides 'Implementation Details' with specific training recipes for Attention Copy (Table 15) and Attention Distillation (Table 16), including optimizer, base learning rate, weight decay, batch size, learning rate schedule, warmup epochs, training epochs, and augmentation.