reproducibilityindex.ai

Stand-Alone Self-Attention in Vision Models

Authors: Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, Jon Shlens

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform experiments on Image Net classiﬁcation task [55] which contains 1.28 million training images and 50000 test images. The procedure described in Section 3.1 of replacing the spatial convolution layer with a self-attention layer from inside each bottleneck block of a Res Net-50 [15] model is used to create the attention model. [...] Results Table 1 and Figure 5 shows the results of the full attention variant compared with the convolution baseline.
Researcher Affiliation	Industry	Prajit Ramachandran Niki Parmar Ashish Vaswani Irwan Bello Anselm Levskaya Jonathon Shlens Google Research, Brain Team {prajit, nikip, avaswani}@google.com
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code for this project is made available.1 1 https://github.com/google-research/google-research/tree/master/standalone_self_attention_in_vision_models
Open Datasets	Yes	We perform experiments on Image Net classiﬁcation task [55] which contains 1.28 million training images and 50000 test images. [...] We evaluate attention models on the COCO object detection task [56] using the Retina Net architecture [18].
Dataset Splits	Yes	We perform experiments on Image Net classiﬁcation task [55] which contains 1.28 million training images and 50000 test images. [...] Accuracies computed on validation set.
Hardware Specification	No	The paper mentions 'modern hardware' and 'hardware accelerators' generally, but does not provide specific details such as GPU models, CPU models, or exact TPU versions used for their experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	The multi-head self-attention layer uses a spatial extent of k = 7 and 8 attention heads. The stem performs self-attention within each 4 4 spatial block of the original image, followed by batch normalization and a 4 4 max pool operation. Exact hyperparameters can be found in the appendix.