On the Relationship between Self-Attention and Convolutional Layers

Authors: Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Specifically, we prove that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer. Our numerical experiments then show that self-attention layers attend to pixel-grid patterns similarly to CNN layers, corroborating our analysis.
Researcher Affiliation Academia Jean-Baptiste Cordonnier, Andreas Loukas & Martin Jaggi Ecole Polytechnique F ed erale de Lausanne (EPFL) {first.last}@epfl.ch
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. The proof sections contain mathematical derivations.
Open Source Code Yes Our code is publicly available1. 1Code: github.com/epfml/attention-cnn. Website: epfml.github.io/attention-cnn.
Open Datasets Yes We compare it to the standard Res Net18 (He et al., 2015) on the CIFAR-10 dataset (Krizhevsky et al.).
Dataset Splits No The paper mentions training and testing on CIFAR-10 and presents test accuracy, but it does not provide specific details about the training, validation, and test splits (e.g., percentages or sample counts) needed to reproduce the data partitioning. While CIFAR-10 has standard splits, the paper does not explicitly state them.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory amounts) used for running the experiments. It mentions software like PyTorch, but no hardware.
Software Dependencies No We used the Py Torch library (Paszke et al., 2017) and based our implementation on Py Torch Transformers5. While PyTorch and PyTorch Transformers are mentioned, specific version numbers for these software components are not provided, which is required for a reproducible description of ancillary software.
Experiment Setup Yes Hyper-parameters number of layers 6 number of heads 9 hidden dimension 400 intermediate dimension 512 invertible pooling width 2 dropout probability 0.1 layer normalization epsilon 10 12 number of epochs 300 batch size 100 learning rate 0.1 weight decay 0.0001 momentum 0.9 cosine decay linear warm up ratio 0.05