Do Vision Transformers See Like Convolutional Neural Networks?
Authors: Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, Alexey Dosovitskiy
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Analyzing the internal representation structure of Vi Ts and CNNs on image classification benchmarks, we find striking differences between the two architectures, such as Vi T having more uniform representations across all layers. We explore how these differences arise, finding crucial roles played by self-attention, which enables early aggregation of global information, and Vi T residual connections, which strongly propagate features from lower to higher layers. We study the ramifications for spatial localization, demonstrating Vi Ts successfully preserve input spatial information, with noticeable effects from different classification methods. Finally, we study the effect of (pretraining) dataset scale on intermediate features and transfer learning, and conclude with a discussion on connections to new architectures such as the MLP-Mixer. |
| Researcher Affiliation | Industry | Maithra Raghu Google Research, Brain Team maithrar@gmail.com Thomas Unterthiner Google Research, Brain Team unterthiner@google.com Simon Kornblith Google Research, Brain Team kornblith@google.com Chiyuan Zhang Google Research, Brain Team chiyuan@google.com Alexey Dosovitskiy Google Research, Brain Team adosovitskiy@google.com |
| Pseudocode | No | The paper describes various analyses and experimental procedures but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] |
| Open Datasets | Yes | Unless otherwise specified, models are trained on the JFT-300M dataset [40], although we also investigate models trained on the Image Net ILSVRC 2012 dataset [12, 37] and standard transfer learning benchmarks [50, 14]. |
| Dataset Splits | Yes | The paper states: 'Unless otherwise specified, models are trained on the JFT-300M dataset [40], although we also investigate models trained on the Image Net ILSVRC 2012 dataset [12, 37] and standard transfer learning benchmarks [50, 14].' These are well-established benchmark datasets known to have predefined training, validation, and testing splits, which implies their use in the experiments. |
| Hardware Specification | No | 3. If you ran experiments... (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] |
| Software Dependencies | No | The paper mentions using specific models (e.g., ViT-B/32, ResNet50) and analysis techniques (e.g., CKA), but it does not specify any software names with version numbers (e.g., Python, TensorFlow, PyTorch, CUDA versions) that would be needed to replicate the experiments. |
| Experiment Setup | Yes | We provide further details of the experimental setting in Appendix A. ... 3. If you ran experiments... (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] |