Vision Transformers Need Registers
Authors: Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we identify and characterize artifacts in feature maps of both supervised and self-supervised Vi T networks. ...We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing. |
| Researcher Affiliation | Collaboration | 1 FAIR, Meta 2 Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper refers to official repositories of existing methods (DeiT-III, Open CLIP, DINOv2) that they used and modified, but does not state that their specific changes or their complete codebase is made open source. |
| Open Datasets | Yes | We run this method on the Image Net-22k dataset...We run the Open CLIP method on a text-image-aligned corpus based on Shutterstock...We run this method on Image Net-22k with the Vi T-L configuration. (Russakovsky et al., 2015) |
| Dataset Splits | No | The paper mentions 'linear probing on Image Net classification, ADE20k Segmentation, and NYUd monocular depth estimation. We follow the experimental protocol outlined in Oquab et al. (2023).' and evaluation on 'PASCAL VOC 2007 and 2012 and COCO 20k' but does not specify explicit training, validation, or test split percentages or counts for their own experiments. |
| Hardware Specification | No | The paper does not specify the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud instances). |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies (e.g., PyTorch version, Python version, CUDA version). |
| Experiment Setup | Yes | We run this method on the Image Net-22k dataset, using the Vi T-B settings... We run the Open CLIP method... We use a Vi T-B/16 image encoder... We run this method on Image Net-22k with the Vi T-L configuration... We train DINOv2 Vi T-L/14 models with 0, 1, 2, 4, 8 or 16 registers. In all our experiments, we kept 4 register tokens. |