Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual Knowledge

Authors: Fawaz Sammani, Nikos Deligiannis

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We answer this question via an approach of textual concept-based explanations, showing their effectiveness, and perform an analysis encompassing a pool of 13 CLIP models varying in architecture, size and pretraining datasets. We explore those different aspects in relation to mutual knowledge, and analyze zero-shot predictions. Our approach demonstrates an effective and human-friendly way of understanding zero-shot classification decisions with CLIP.
Researcher Affiliation Collaboration Fawaz Sammani, Nikos Deligiannis ETRO Department, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussels, Belgium imec, Kapeldreef 75, B-3001 Leuven, Belgium fawaz.sammani@vub.be, ndeligia@etrovub.be
Pseudocode No The paper describes the method step-by-step in narrative form, but does not include a formally labeled 'Algorithm' or 'Pseudocode' block.
Open Source Code Yes 1https://github.com/fawazsammani/clip-interpret-mutual-knowledge
Open Datasets Yes We train these baselines on the full Image Net training set, and report the Top-1 and Top-5 accuracy results on the Image Net validation set in Table 5.
Dataset Splits Yes Models and Datasets: Our MI analysis considers a wide range of CLIP models varying in architecture, size and pretraining datasets, evaluated on the full Image Net validation split [30].
Hardware Specification Yes All experiments are ran on a single NVIDIA RTX3090 GPU.
Software Dependencies No The paper mentions software components like 'Adam optimizer [25]' and 'cosine schedule [34]' but does not provide specific version numbers for any libraries or frameworks used (e.g., PyTorch, TensorFlow).
Experiment Setup Yes The baselines are trained using the Adam optimizer [25] with a batch size of 64 and a learning rate of 1e-4 decayed using a cosine schedule [34] to 1e-5. We set q = 512. In section 3.1, we set k = 500 and τ = 1. For analyzing the mutual information and its dynamics in Section 4.1 in the main paper, we set the number of concepts L = 5 and consider the top 3 textual concepts for each visual concept.