Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual Knowledge
Authors: Fawaz Sammani, Nikos Deligiannis
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We answer this question via an approach of textual concept-based explanations, showing their effectiveness, and perform an analysis encompassing a pool of 13 CLIP models varying in architecture, size and pretraining datasets. We explore those different aspects in relation to mutual knowledge, and analyze zero-shot predictions. Our approach demonstrates an effective and human-friendly way of understanding zero-shot classification decisions with CLIP. |
| Researcher Affiliation | Collaboration | Fawaz Sammani, Nikos Deligiannis ETRO Department, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussels, Belgium imec, Kapeldreef 75, B-3001 Leuven, Belgium fawaz.sammani@vub.be, ndeligia@etrovub.be |
| Pseudocode | No | The paper describes the method step-by-step in narrative form, but does not include a formally labeled 'Algorithm' or 'Pseudocode' block. |
| Open Source Code | Yes | 1https://github.com/fawazsammani/clip-interpret-mutual-knowledge |
| Open Datasets | Yes | We train these baselines on the full Image Net training set, and report the Top-1 and Top-5 accuracy results on the Image Net validation set in Table 5. |
| Dataset Splits | Yes | Models and Datasets: Our MI analysis considers a wide range of CLIP models varying in architecture, size and pretraining datasets, evaluated on the full Image Net validation split [30]. |
| Hardware Specification | Yes | All experiments are ran on a single NVIDIA RTX3090 GPU. |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer [25]' and 'cosine schedule [34]' but does not provide specific version numbers for any libraries or frameworks used (e.g., PyTorch, TensorFlow). |
| Experiment Setup | Yes | The baselines are trained using the Adam optimizer [25] with a batch size of 64 and a learning rate of 1e-4 decayed using a cosine schedule [34] to 1e-5. We set q = 512. In section 3.1, we set k = 500 and τ = 1. For analyzing the mutual information and its dynamics in Section 4.1 in the main paper, we set the number of concepts L = 5 and consider the top 3 textual concepts for each visual concept. |