Interpreting CLIP's Image Representation via Text-Based Decomposition

Authors: Yossi Gandelsman, Alexei A Efros, Jacob Steinhardt

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP s text representation to interpret the summands. Interpreting the attention heads, we characterize each head s role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models. We evaluate the methods on Image Net-segmentation (Guillaumin et al., 2014), which contains a subset of 4,276 images from the Image Net validation set with annotated segmentations. Table 4 displays the results: our decomposition is more accurate than existing methods across all metrics.
Researcher Affiliation Academia Yossi Gandelsman, Alexei A. Efros, and Jacob Steinhardt UC Berkeley {yossi gandelsman,aaefros,jsteinhardt}@berkeley.edu
Pseudocode Yes Algorithm 1: TEXTSPAN
Open Source Code Yes Project page and code: https://yossigandelsman.github.io/clip_decomposition/
Open Datasets Yes In our experiments, we compute means for each component over the Image Net (IN) validation set and evaluate the drop in IN classification accuracy. We validate this idea on the Waterbirds dataset (Sagawa et al., 2019), which combines waterbird and landbird photographs from the CUB dataset Welinder et al. (2010) with image backgrounds (water/land background) from the Places dataset (Zhou etg al., 2016).
Dataset Splits Yes In our experiments, we compute means for each component over the Image Net (IN) validation set and evaluate the drop in IN classification accuracy. We evaluate the methods on Image Net-segmentation (Guillaumin et al., 2014), which contains a subset of 4,276 images from the Image Net validation set with annotated segmentations.
Hardware Specification No No specific hardware details (like GPU models, CPU types, or memory specifications) used for running the experiments are explicitly mentioned in the paper.
Software Dependencies No The paper mentions software frameworks and models like CLIP-ViT and Open CLIP, but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA libraries.
Experiment Setup Yes Experimental setting. We apply TEXTSPAN to all the heads in the last 4 layers of CLIP Vi T-L, which are responsible for most of the direct effects on the image representation (see Section 3.2). We consider a variety of output sizes m {10, 20, 30, 40, 50, 60}.