Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the Surprising Effectiveness of Attention Transfer for Vision Transformers

Authors: Alex Li, Yuandong Tian, Beidi Chen, Deepak Pathak, Xinlei Chen

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We systematically study various aspects of our findings on the sufficiency of attention maps, including distribution shift settings where they underperform fine-tuning. We hope our exploration provides a better understanding of what pre-training accomplishes and leads to a useful alternative to the standard practice of fine-tuning.
Researcher Affiliation	Collaboration	Alexander C. Li Carnegie Mellon University Yuandong Tian FAIR Beidi Chen Carnegie Mellon University Deepak Pathak Carnegie Mellon University Xinlei Chen FAIR
Pseudocode	No	The paper does not contain any sections explicitly labeled as 'Pseudocode' or 'Algorithm', nor are there any structured algorithm blocks presented in a pseudocode format.
Open Source Code	Yes	Code to reproduce our results is at https://github.com/alexlioralexli/attention-transfer.
Open Datasets	Yes	Compared to a Vi T-L trained from scratch (with an accuracy score of 83.0), fine-tuning the MAE pre-trained on the same dataset results in a significant improvement to 85.7.1... Image Net-1K classification [10].
Dataset Splits	No	The paper does not explicitly provide numerical details about training, validation, or test dataset splits (e.g., percentages or sample counts). It refers to standard datasets and benchmarks where splits are typically predefined, but does not specify them within the paper itself.
Hardware Specification	Yes	We compare fine-tuning vs attention distillation on a 16GB NVIDIA GP100 with Vi T-L and a batch size of 16:
Software Dependencies	No	The paper mentions several components like 'optimizer Adam W', 'Rand Aug', 'mixup', 'cutmix', and 'drop path' but does not specify their version numbers or the versions of overarching software environments like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	Appendix C provides 'Implementation Details' with specific training recipes for Attention Copy (Table 15) and Attention Distillation (Table 16), including optimizer, base learning rate, weight decay, batch size, learning rate schedule, warmup epochs, training epochs, and augmentation.