Interpreting CLIP's Image Representation via Text-Based Decomposition
Authors: Yossi Gandelsman, Alexei A Efros, Jacob Steinhardt
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP s text representation to interpret the summands. Interpreting the attention heads, we characterize each head s role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models. We evaluate the methods on Image Net-segmentation (Guillaumin et al., 2014), which contains a subset of 4,276 images from the Image Net validation set with annotated segmentations. Table 4 displays the results: our decomposition is more accurate than existing methods across all metrics. |
| Researcher Affiliation | Academia | Yossi Gandelsman, Alexei A. Efros, and Jacob Steinhardt UC Berkeley {yossi gandelsman,aaefros,jsteinhardt}@berkeley.edu |
| Pseudocode | Yes | Algorithm 1: TEXTSPAN |
| Open Source Code | Yes | Project page and code: https://yossigandelsman.github.io/clip_decomposition/ |
| Open Datasets | Yes | In our experiments, we compute means for each component over the Image Net (IN) validation set and evaluate the drop in IN classification accuracy. We validate this idea on the Waterbirds dataset (Sagawa et al., 2019), which combines waterbird and landbird photographs from the CUB dataset Welinder et al. (2010) with image backgrounds (water/land background) from the Places dataset (Zhou etg al., 2016). |
| Dataset Splits | Yes | In our experiments, we compute means for each component over the Image Net (IN) validation set and evaluate the drop in IN classification accuracy. We evaluate the methods on Image Net-segmentation (Guillaumin et al., 2014), which contains a subset of 4,276 images from the Image Net validation set with annotated segmentations. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or memory specifications) used for running the experiments are explicitly mentioned in the paper. |
| Software Dependencies | No | The paper mentions software frameworks and models like CLIP-ViT and Open CLIP, but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA libraries. |
| Experiment Setup | Yes | Experimental setting. We apply TEXTSPAN to all the heads in the last 4 layers of CLIP Vi T-L, which are responsible for most of the direct effects on the image representation (see Section 3.2). We consider a variety of output sizes m {10, 20, 30, 40, 50, 60}. |