XCiT: Cross-Covariance Image Transformers

Authors: Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Herve Jegou

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness and generality of XCi T by reporting excellent results on multiple vision benchmarks, including (self-supervised) image classification on Image Net-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.
Researcher Affiliation Collaboration 1Facebook AI 2Inria 3Sorbonne University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Code: https://github.com/facebookresearch/xcit
Open Datasets Yes We use Image Net-1k [19] to train and evaluate our models for image classification. It consists of 1.28M training images and 50k validation images, labeled across 1,000 semantic categories.
Dataset Splits Yes We use Image Net-1k [19] to train and evaluate our models for image classification. It consists of 1.28M training images and 50k validation images, labeled across 1,000 semantic categories.
Hardware Specification Yes All measurements are performed with a batch size of 64 on a single V100-32GB GPU.
Software Dependencies No Our implementation is based on the Timm library [72]. Our implementation is based on the mmdetection library [13]. Our implementation is based on the mmsegmentation library [16]. No specific version numbers are provided for these libraries.
Experiment Setup Yes We train our model for 400 epochs with the Adam W optimizer [45] using a cosine learning rate decay. In order to enhance the training of larger models, we utilize Layer Scale [67] and adjust the stochastic depth [33] for each of our models accordingly (see the supplementary material for details). The model is trained for 36 epochs (3x schedule) using the Adam W optimizer with learning rate of 10 4, 0.05 weight decay and 16 batch size. We train for 80k and 160k iterations for Semantic FPN and Uper Net respectively. Following [44], the models are trained using batch size 16 and an Adam W optimizer with learning rate of 6 10 5 and 0.01 weight decay.