On the Relationship between Self-Attention and Convolutional Layers
Authors: Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Specifically, we prove that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer. Our numerical experiments then show that self-attention layers attend to pixel-grid patterns similarly to CNN layers, corroborating our analysis. |
| Researcher Affiliation | Academia | Jean-Baptiste Cordonnier, Andreas Loukas & Martin Jaggi Ecole Polytechnique F ed erale de Lausanne (EPFL) {first.last}@epfl.ch |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. The proof sections contain mathematical derivations. |
| Open Source Code | Yes | Our code is publicly available1. 1Code: github.com/epfml/attention-cnn. Website: epfml.github.io/attention-cnn. |
| Open Datasets | Yes | We compare it to the standard Res Net18 (He et al., 2015) on the CIFAR-10 dataset (Krizhevsky et al.). |
| Dataset Splits | No | The paper mentions training and testing on CIFAR-10 and presents test accuracy, but it does not provide specific details about the training, validation, and test splits (e.g., percentages or sample counts) needed to reproduce the data partitioning. While CIFAR-10 has standard splits, the paper does not explicitly state them. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory amounts) used for running the experiments. It mentions software like PyTorch, but no hardware. |
| Software Dependencies | No | We used the Py Torch library (Paszke et al., 2017) and based our implementation on Py Torch Transformers5. While PyTorch and PyTorch Transformers are mentioned, specific version numbers for these software components are not provided, which is required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | Hyper-parameters number of layers 6 number of heads 9 hidden dimension 400 intermediate dimension 512 invertible pooling width 2 dropout probability 0.1 layer normalization epsilon 10 12 number of epochs 300 batch size 100 learning rate 0.1 weight decay 0.0001 momentum 0.9 cosine decay linear warm up ratio 0.05 |