Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

Authors: Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Applying our framework to various Vi T variants (e.g. Dei T, DINO, DINOv2, Swin, Max Vi T), we gain insights into the roles of different components concerning particular image features. These insights facilitate applications such as image retrieval using text descriptions or reference images, visualizing token importance heatmaps, and mitigating spurious correlations. We release our code to reproduce the experiments at https://github.com/Sriram B-98/vit-decompose
Researcher Affiliation Academia Sriram Balasubramanian Department of Computer Science University of Maryland, College Park sriramb@cs.umd.edu Samyadeep Basu Department of Computer Science University of Maryland, College Park sbasu12@umd.edu Soheil Feizi Department of Computer Science University of Maryland, College Park sfeizi@cs.umd.edu
Pseudocode Yes Algorithm 1 REPDECOMPOSE
Open Source Code Yes We release our code to reproduce the experiments at https://github.com/Sriram B-98/vit-decompose
Open Datasets Yes We train these linear maps with regularizations so that these maps preserve the roles of the individual components while also aligning the model s image representation with CLIP s image representation. This allows us to map each contribution vector from any component to CLIP space, where they can be interpreted through text using a CLIP text encoder. We obtain z CLIP from the CLIP image encoder and {ci}N i=1 from running REPDECOMPOSE on the final representation of the model. We can now use our framework to identify components which can retrieve images possessing a certain feature most effectively. Using the scoring function described above, we can identify the top k components {ci}k i=1 which are the most responsive to a given feature p. We can use the cosine similarity of Pk i=1 fi(ci) to the CLIP embedding of an instantiation sp of the feature p to retrieve the closest matches in Image Net-1k validation split. We then embed each of these texts to CLIP space, obtaining a set of embeddings B. We also calculate the CLIP aligned contributions fi(ci) for each component i over an image dataset (Image Net-1k validation split).
Dataset Splits Yes The aligners are trained with learning rate = 3 10 4 , λ = 1/768 using the Adam optimizer (with default values for everything else) for upto an epoch on Image Net validation split.
Hardware Specification Yes The bulk of computation is utilized to compute component contributions and train the aligner. Most of the experiments in the paper were conducted on a single RTX A5000 GPU, with 32GB CPU memory and 4 compute nodes.
Software Dependencies No We use the following models from Huggingface s timm [35] repository: (i) Dei T (Vi T-B-16) [32], (ii) DINO (Vi T-B-16) [7], (iii) DINOv2 (Vi T-B-14) [22], (iv) Swin Base (patch size = 4, window size = 7) [17], (v) Max Vi T Small [33], along with (vi) CLIP (Vi T-B-16) [9] from open_clip [13]. The aligners are trained with learning rate = 3 10 4 , λ = 1/768 using the Adam optimizer (with default values for everything else) for upto an epoch on Image Net validation split.
Experiment Setup Yes The aligners are trained with learning rate = 3 10 4 , λ = 1/768 using the Adam optimizer (with default values for everything else) for upto an epoch on Image Net validation split. Hyperparameters were loosely tuned for the Dei T model using the cosine similarity as a metric, and then fixed for the rest of the models.