ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
Authors: Stéphane D’Ascoli, Hugo Touvron, Matthew L Leavitt, Ari S Morcos, Giulio Biroli, Levent Sagun
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then perform experiments based on the Dei T (Touvron et al., 2020), with a certain number of SA layers replaced by GPSA layers. The resulting Convolutional Vision Transformer (Con Vi T) outperforms the Dei T while boasting a much improved sample-efficiency (Fig. 2). |
| Researcher Affiliation | Collaboration | 1Department of Physics, Ecole Normale Sup erieure, Paris, France 2Facebook AI Research, Paris, France. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and models are released publicly at https://github.com/ facebookresearch/convit. [...] We provide an open-source implementation of our method as well as pretrained models at the following address: https://github.com/ facebookresearch/convit. |
| Open Datasets | Yes | The resulting convolutional-like Vi T architecture, Con Vi T, outperforms the Dei T (Touvron et al., 2020) on Image Net, while offering a much improved sample efficiency. |
| Dataset Splits | Yes | We compare the sample efficiency of our Con Vi T-S (see Tab. 1) with that of the Dei T-S by training them on restricted portions of Image Net-1k, where we only keep a certain fraction of the images of each class. |
| Hardware Specification | Yes | Speed is the number of images processed per second on a Nvidia Quadro GP100 GPU at batch size 128. |
| Software Dependencies | No | The paper mentions basing the work on DeiT and using certain hyperparameters but does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | To maintain stable training while fitting these models on 8 GPUs, we lowered the learning rate from 0.0005 to 0.0004 and the batch size from 1024 to 512. |