ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

Authors: Stéphane D’Ascoli, Hugo Touvron, Matthew L Leavitt, Ari S Morcos, Giulio Biroli, Levent Sagun

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then perform experiments based on the Dei T (Touvron et al., 2020), with a certain number of SA layers replaced by GPSA layers. The resulting Convolutional Vision Transformer (Con Vi T) outperforms the Dei T while boasting a much improved sample-efficiency (Fig. 2).
Researcher Affiliation Collaboration 1Department of Physics, Ecole Normale Sup erieure, Paris, France 2Facebook AI Research, Paris, France.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code and models are released publicly at https://github.com/ facebookresearch/convit. [...] We provide an open-source implementation of our method as well as pretrained models at the following address: https://github.com/ facebookresearch/convit.
Open Datasets Yes The resulting convolutional-like Vi T architecture, Con Vi T, outperforms the Dei T (Touvron et al., 2020) on Image Net, while offering a much improved sample efficiency.
Dataset Splits Yes We compare the sample efficiency of our Con Vi T-S (see Tab. 1) with that of the Dei T-S by training them on restricted portions of Image Net-1k, where we only keep a certain fraction of the images of each class.
Hardware Specification Yes Speed is the number of images processed per second on a Nvidia Quadro GP100 GPU at batch size 128.
Software Dependencies No The paper mentions basing the work on DeiT and using certain hyperparameters but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes To maintain stable training while fitting these models on 8 GPUs, we lowered the learning rate from 0.0005 to 0.0004 and the batch size from 1024 to 512.