Vision Transformers Need Registers

Authors: Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we identify and characterize artifacts in feature maps of both supervised and self-supervised Vi T networks. ...We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.
Researcher Affiliation Collaboration 1 FAIR, Meta 2 Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper refers to official repositories of existing methods (DeiT-III, Open CLIP, DINOv2) that they used and modified, but does not state that their specific changes or their complete codebase is made open source.
Open Datasets Yes We run this method on the Image Net-22k dataset...We run the Open CLIP method on a text-image-aligned corpus based on Shutterstock...We run this method on Image Net-22k with the Vi T-L configuration. (Russakovsky et al., 2015)
Dataset Splits No The paper mentions 'linear probing on Image Net classification, ADE20k Segmentation, and NYUd monocular depth estimation. We follow the experimental protocol outlined in Oquab et al. (2023).' and evaluation on 'PASCAL VOC 2007 and 2012 and COCO 20k' but does not specify explicit training, validation, or test split percentages or counts for their own experiments.
Hardware Specification No The paper does not specify the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud instances).
Software Dependencies No The paper does not provide specific version numbers for software dependencies (e.g., PyTorch version, Python version, CUDA version).
Experiment Setup Yes We run this method on the Image Net-22k dataset, using the Vi T-B settings... We run the Open CLIP method... We use a Vi T-B/16 image encoder... We run this method on Image Net-22k with the Vi T-L configuration... We train DINOv2 Vi T-L/14 models with 0, 1, 2, 4, 8 or 16 registers. In all our experiments, we kept 4 register tokens.