reproducibilityindex.ai

On the Relationship between Self-Attention and Convolutional Layers

Authors: Jean-Baptiste Cordonnier, Andreas Loukas, Martin Jaggi

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Speciﬁcally, we prove that a multi-head self-attention layer with sufﬁcient number of heads is at least as expressive as any convolutional layer. Our numerical experiments then show that self-attention layers attend to pixel-grid patterns similarly to CNN layers, corroborating our analysis.
Researcher Affiliation	Academia	Jean-Baptiste Cordonnier, Andreas Loukas & Martin Jaggi Ecole Polytechnique F ed erale de Lausanne (EPFL) {first.last}@epfl.ch
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. The proof sections contain mathematical derivations.
Open Source Code	Yes	Our code is publicly available1. 1Code: github.com/epfml/attention-cnn. Website: epfml.github.io/attention-cnn.
Open Datasets	Yes	We compare it to the standard Res Net18 (He et al., 2015) on the CIFAR-10 dataset (Krizhevsky et al.).
Dataset Splits	No	The paper mentions training and testing on CIFAR-10 and presents test accuracy, but it does not provide specific details about the training, validation, and test splits (e.g., percentages or sample counts) needed to reproduce the data partitioning. While CIFAR-10 has standard splits, the paper does not explicitly state them.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory amounts) used for running the experiments. It mentions software like PyTorch, but no hardware.
Software Dependencies	No	We used the Py Torch library (Paszke et al., 2017) and based our implementation on Py Torch Transformers5. While PyTorch and PyTorch Transformers are mentioned, specific version numbers for these software components are not provided, which is required for a reproducible description of ancillary software.
Experiment Setup	Yes	Hyper-parameters number of layers 6 number of heads 9 hidden dimension 400 intermediate dimension 512 invertible pooling width 2 dropout probability 0.1 layer normalization epsilon 10 12 number of epochs 300 batch size 100 learning rate 0.1 weight decay 0.0001 momentum 0.9 cosine decay linear warm up ratio 0.05