reproducibilityindex.ai

Interpreting CLIP's Image Representation via Text-Based Decomposition

Authors: Yossi Gandelsman, Alexei A Efros, Jacob Steinhardt

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP s text representation to interpret the summands. Interpreting the attention heads, we characterize each head s role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models. We evaluate the methods on Image Net-segmentation (Guillaumin et al., 2014), which contains a subset of 4,276 images from the Image Net validation set with annotated segmentations. Table 4 displays the results: our decomposition is more accurate than existing methods across all metrics.
Researcher Affiliation	Academia	Yossi Gandelsman, Alexei A. Efros, and Jacob Steinhardt UC Berkeley {yossi gandelsman,aaefros,jsteinhardt}@berkeley.edu
Pseudocode	Yes	Algorithm 1: TEXTSPAN
Open Source Code	Yes	Project page and code: https://yossigandelsman.github.io/clip_decomposition/
Open Datasets	Yes	In our experiments, we compute means for each component over the Image Net (IN) validation set and evaluate the drop in IN classification accuracy. We validate this idea on the Waterbirds dataset (Sagawa et al., 2019), which combines waterbird and landbird photographs from the CUB dataset Welinder et al. (2010) with image backgrounds (water/land background) from the Places dataset (Zhou etg al., 2016).
Dataset Splits	Yes	In our experiments, we compute means for each component over the Image Net (IN) validation set and evaluate the drop in IN classification accuracy. We evaluate the methods on Image Net-segmentation (Guillaumin et al., 2014), which contains a subset of 4,276 images from the Image Net validation set with annotated segmentations.
Hardware Specification	No	No specific hardware details (like GPU models, CPU types, or memory specifications) used for running the experiments are explicitly mentioned in the paper.
Software Dependencies	No	The paper mentions software frameworks and models like CLIP-ViT and Open CLIP, but does not provide specific version numbers for software dependencies such as Python, PyTorch, or CUDA libraries.
Experiment Setup	Yes	Experimental setting. We apply TEXTSPAN to all the heads in the last 4 layers of CLIP Vi T-L, which are responsible for most of the direct effects on the image representation (see Section 3.2). We consider a variety of output sizes m {10, 20, 30, 40, 50, 60}.