reproducibilityindex.ai

Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

Authors: Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Applying our framework to various Vi T variants (e.g. Dei T, DINO, DINOv2, Swin, Max Vi T), we gain insights into the roles of different components concerning particular image features. These insights facilitate applications such as image retrieval using text descriptions or reference images, visualizing token importance heatmaps, and mitigating spurious correlations. We release our code to reproduce the experiments at https://github.com/Sriram B-98/vit-decompose
Researcher Affiliation	Academia	Sriram Balasubramanian Department of Computer Science University of Maryland, College Park sriramb@cs.umd.edu Samyadeep Basu Department of Computer Science University of Maryland, College Park sbasu12@umd.edu Soheil Feizi Department of Computer Science University of Maryland, College Park sfeizi@cs.umd.edu
Pseudocode	Yes	Algorithm 1 REPDECOMPOSE
Open Source Code	Yes	We release our code to reproduce the experiments at https://github.com/Sriram B-98/vit-decompose
Open Datasets	Yes	We train these linear maps with regularizations so that these maps preserve the roles of the individual components while also aligning the model s image representation with CLIP s image representation. This allows us to map each contribution vector from any component to CLIP space, where they can be interpreted through text using a CLIP text encoder. We obtain z CLIP from the CLIP image encoder and {ci}N i=1 from running REPDECOMPOSE on the final representation of the model. We can now use our framework to identify components which can retrieve images possessing a certain feature most effectively. Using the scoring function described above, we can identify the top k components {ci}k i=1 which are the most responsive to a given feature p. We can use the cosine similarity of Pk i=1 fi(ci) to the CLIP embedding of an instantiation sp of the feature p to retrieve the closest matches in Image Net-1k validation split. We then embed each of these texts to CLIP space, obtaining a set of embeddings B. We also calculate the CLIP aligned contributions fi(ci) for each component i over an image dataset (Image Net-1k validation split).
Dataset Splits	Yes	The aligners are trained with learning rate = 3 10 4 , λ = 1/768 using the Adam optimizer (with default values for everything else) for upto an epoch on Image Net validation split.
Hardware Specification	Yes	The bulk of computation is utilized to compute component contributions and train the aligner. Most of the experiments in the paper were conducted on a single RTX A5000 GPU, with 32GB CPU memory and 4 compute nodes.
Software Dependencies	No	We use the following models from Huggingface s timm [35] repository: (i) Dei T (Vi T-B-16) [32], (ii) DINO (Vi T-B-16) [7], (iii) DINOv2 (Vi T-B-14) [22], (iv) Swin Base (patch size = 4, window size = 7) [17], (v) Max Vi T Small [33], along with (vi) CLIP (Vi T-B-16) [9] from open_clip [13]. The aligners are trained with learning rate = 3 10 4 , λ = 1/768 using the Adam optimizer (with default values for everything else) for upto an epoch on Image Net validation split.
Experiment Setup	Yes	The aligners are trained with learning rate = 3 10 4 , λ = 1/768 using the Adam optimizer (with default values for everything else) for upto an epoch on Image Net validation split. Hyperparameters were loosely tuned for the Dei T model using the cosine similarity as a metric, and then fixed for the rest of the models.