Stand-Alone Self-Attention in Vision Models

Authors: Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, Jon Shlens

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform experiments on Image Net classification task [55] which contains 1.28 million training images and 50000 test images. The procedure described in Section 3.1 of replacing the spatial convolution layer with a self-attention layer from inside each bottleneck block of a Res Net-50 [15] model is used to create the attention model. [...] Results Table 1 and Figure 5 shows the results of the full attention variant compared with the convolution baseline.
Researcher Affiliation Industry Prajit Ramachandran Niki Parmar Ashish Vaswani Irwan Bello Anselm Levskaya Jonathon Shlens Google Research, Brain Team {prajit, nikip, avaswani}@google.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code for this project is made available.1 1 https://github.com/google-research/google-research/tree/master/standalone_self_attention_in_vision_models
Open Datasets Yes We perform experiments on Image Net classification task [55] which contains 1.28 million training images and 50000 test images. [...] We evaluate attention models on the COCO object detection task [56] using the Retina Net architecture [18].
Dataset Splits Yes We perform experiments on Image Net classification task [55] which contains 1.28 million training images and 50000 test images. [...] Accuracies computed on validation set.
Hardware Specification No The paper mentions 'modern hardware' and 'hardware accelerators' generally, but does not provide specific details such as GPU models, CPU models, or exact TPU versions used for their experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes The multi-head self-attention layer uses a spatial extent of k = 7 and 8 attention heads. The stem performs self-attention within each 4 4 spatial block of the original image, followed by batch normalization and a 4 4 max pool operation. Exact hyperparameters can be found in the appendix.