Stand-Alone Self-Attention in Vision Models
Authors: Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, Jon Shlens
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform experiments on Image Net classification task [55] which contains 1.28 million training images and 50000 test images. The procedure described in Section 3.1 of replacing the spatial convolution layer with a self-attention layer from inside each bottleneck block of a Res Net-50 [15] model is used to create the attention model. [...] Results Table 1 and Figure 5 shows the results of the full attention variant compared with the convolution baseline. |
| Researcher Affiliation | Industry | Prajit Ramachandran Niki Parmar Ashish Vaswani Irwan Bello Anselm Levskaya Jonathon Shlens Google Research, Brain Team {prajit, nikip, avaswani}@google.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code for this project is made available.1 1 https://github.com/google-research/google-research/tree/master/standalone_self_attention_in_vision_models |
| Open Datasets | Yes | We perform experiments on Image Net classification task [55] which contains 1.28 million training images and 50000 test images. [...] We evaluate attention models on the COCO object detection task [56] using the Retina Net architecture [18]. |
| Dataset Splits | Yes | We perform experiments on Image Net classification task [55] which contains 1.28 million training images and 50000 test images. [...] Accuracies computed on validation set. |
| Hardware Specification | No | The paper mentions 'modern hardware' and 'hardware accelerators' generally, but does not provide specific details such as GPU models, CPU models, or exact TPU versions used for their experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | The multi-head self-attention layer uses a spatial extent of k = 7 and 8 attention heads. The stem performs self-attention within each 4 4 spatial block of the original image, followed by batch normalization and a 4 4 max pool operation. Exact hyperparameters can be found in the appendix. |