reproducibilityindex.ai

Do Vision Transformers See Like Convolutional Neural Networks?

Authors: Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, Alexey Dosovitskiy

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Analyzing the internal representation structure of Vi Ts and CNNs on image classiﬁcation benchmarks, we ﬁnd striking differences between the two architectures, such as Vi T having more uniform representations across all layers. We explore how these differences arise, ﬁnding crucial roles played by self-attention, which enables early aggregation of global information, and Vi T residual connections, which strongly propagate features from lower to higher layers. We study the ramiﬁcations for spatial localization, demonstrating Vi Ts successfully preserve input spatial information, with noticeable effects from different classiﬁcation methods. Finally, we study the effect of (pretraining) dataset scale on intermediate features and transfer learning, and conclude with a discussion on connections to new architectures such as the MLP-Mixer.
Researcher Affiliation	Industry	Maithra Raghu Google Research, Brain Team maithrar@gmail.com Thomas Unterthiner Google Research, Brain Team unterthiner@google.com Simon Kornblith Google Research, Brain Team kornblith@google.com Chiyuan Zhang Google Research, Brain Team chiyuan@google.com Alexey Dosovitskiy Google Research, Brain Team adosovitskiy@google.com
Pseudocode	No	The paper describes various analyses and experimental procedures but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No]
Open Datasets	Yes	Unless otherwise speciﬁed, models are trained on the JFT-300M dataset [40], although we also investigate models trained on the Image Net ILSVRC 2012 dataset [12, 37] and standard transfer learning benchmarks [50, 14].
Dataset Splits	Yes	The paper states: 'Unless otherwise speciﬁed, models are trained on the JFT-300M dataset [40], although we also investigate models trained on the Image Net ILSVRC 2012 dataset [12, 37] and standard transfer learning benchmarks [50, 14].' These are well-established benchmark datasets known to have predefined training, validation, and testing splits, which implies their use in the experiments.
Hardware Specification	No	3. If you ran experiments... (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No]
Software Dependencies	No	The paper mentions using specific models (e.g., ViT-B/32, ResNet50) and analysis techniques (e.g., CKA), but it does not specify any software names with version numbers (e.g., Python, TensorFlow, PyTorch, CUDA versions) that would be needed to replicate the experiments.
Experiment Setup	Yes	We provide further details of the experimental setting in Appendix A. ... 3. If you ran experiments... (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes]