reproducibilityindex.ai

Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual Knowledge

Authors: Fawaz Sammani, Nikos Deligiannis

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We answer this question via an approach of textual concept-based explanations, showing their effectiveness, and perform an analysis encompassing a pool of 13 CLIP models varying in architecture, size and pretraining datasets. We explore those different aspects in relation to mutual knowledge, and analyze zero-shot predictions. Our approach demonstrates an effective and human-friendly way of understanding zero-shot classification decisions with CLIP.
Researcher Affiliation	Collaboration	Fawaz Sammani, Nikos Deligiannis ETRO Department, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussels, Belgium imec, Kapeldreef 75, B-3001 Leuven, Belgium fawaz.sammani@vub.be, ndeligia@etrovub.be
Pseudocode	No	The paper describes the method step-by-step in narrative form, but does not include a formally labeled 'Algorithm' or 'Pseudocode' block.
Open Source Code	Yes	1https://github.com/fawazsammani/clip-interpret-mutual-knowledge
Open Datasets	Yes	We train these baselines on the full Image Net training set, and report the Top-1 and Top-5 accuracy results on the Image Net validation set in Table 5.
Dataset Splits	Yes	Models and Datasets: Our MI analysis considers a wide range of CLIP models varying in architecture, size and pretraining datasets, evaluated on the full Image Net validation split [30].
Hardware Specification	Yes	All experiments are ran on a single NVIDIA RTX3090 GPU.
Software Dependencies	No	The paper mentions software components like 'Adam optimizer [25]' and 'cosine schedule [34]' but does not provide specific version numbers for any libraries or frameworks used (e.g., PyTorch, TensorFlow).
Experiment Setup	Yes	The baselines are trained using the Adam optimizer [25] with a batch size of 64 and a learning rate of 1e-4 decayed using a cosine schedule [34] to 1e-5. We set q = 512. In section 3.1, we set k = 500 and τ = 1. For analyzing the mutual information and its dynamics in Section 4.1 in the main paper, we set the number of concepts L = 5 and consider the top 3 textual concepts for each visual concept.