Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP
Authors: Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Applying our framework to various Vi T variants (e.g. Dei T, DINO, DINOv2, Swin, Max Vi T), we gain insights into the roles of different components concerning particular image features. These insights facilitate applications such as image retrieval using text descriptions or reference images, visualizing token importance heatmaps, and mitigating spurious correlations. We release our code to reproduce the experiments at https://github.com/Sriram B-98/vit-decompose |
| Researcher Affiliation | Academia | Sriram Balasubramanian Department of Computer Science University of Maryland, College Park sriramb@cs.umd.edu Samyadeep Basu Department of Computer Science University of Maryland, College Park sbasu12@umd.edu Soheil Feizi Department of Computer Science University of Maryland, College Park sfeizi@cs.umd.edu |
| Pseudocode | Yes | Algorithm 1 REPDECOMPOSE |
| Open Source Code | Yes | We release our code to reproduce the experiments at https://github.com/Sriram B-98/vit-decompose |
| Open Datasets | Yes | We train these linear maps with regularizations so that these maps preserve the roles of the individual components while also aligning the model s image representation with CLIP s image representation. This allows us to map each contribution vector from any component to CLIP space, where they can be interpreted through text using a CLIP text encoder. We obtain z CLIP from the CLIP image encoder and {ci}N i=1 from running REPDECOMPOSE on the final representation of the model. We can now use our framework to identify components which can retrieve images possessing a certain feature most effectively. Using the scoring function described above, we can identify the top k components {ci}k i=1 which are the most responsive to a given feature p. We can use the cosine similarity of Pk i=1 fi(ci) to the CLIP embedding of an instantiation sp of the feature p to retrieve the closest matches in Image Net-1k validation split. We then embed each of these texts to CLIP space, obtaining a set of embeddings B. We also calculate the CLIP aligned contributions fi(ci) for each component i over an image dataset (Image Net-1k validation split). |
| Dataset Splits | Yes | The aligners are trained with learning rate = 3 10 4 , λ = 1/768 using the Adam optimizer (with default values for everything else) for upto an epoch on Image Net validation split. |
| Hardware Specification | Yes | The bulk of computation is utilized to compute component contributions and train the aligner. Most of the experiments in the paper were conducted on a single RTX A5000 GPU, with 32GB CPU memory and 4 compute nodes. |
| Software Dependencies | No | We use the following models from Huggingface s timm [35] repository: (i) Dei T (Vi T-B-16) [32], (ii) DINO (Vi T-B-16) [7], (iii) DINOv2 (Vi T-B-14) [22], (iv) Swin Base (patch size = 4, window size = 7) [17], (v) Max Vi T Small [33], along with (vi) CLIP (Vi T-B-16) [9] from open_clip [13]. The aligners are trained with learning rate = 3 10 4 , λ = 1/768 using the Adam optimizer (with default values for everything else) for upto an epoch on Image Net validation split. |
| Experiment Setup | Yes | The aligners are trained with learning rate = 3 10 4 , λ = 1/768 using the Adam optimizer (with default values for everything else) for upto an epoch on Image Net validation split. Hyperparameters were loosely tuned for the Dei T model using the cosine similarity as a metric, and then fixed for the rest of the models. |