A Concept-Based Explainability Framework for Large Multimodal Models

Authors: Jayneel Parekh, Pegah KHAYATAN, Mustafa Shukor, Alasdair Newson, Matthieu Cord

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We qualitatively and quantitatively evaluate the results of the learnt concepts. We show that the extracted multimodal concepts are useful to interpret representations of test samples. Finally, we evaluate the disentanglement between different concepts and the quality of grounding concepts visually and textually. Our implementation is publicly available.
Researcher Affiliation Collaboration 1ISIR, Sorbonne Université, Paris, France 2Valeo.ai, Paris, France
Pseudocode No The paper describes methods in text and provides a figure (Fig. 1) as a visual summary, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Our implementation is publicly available.1 1Project page and code: https://jayneelparekh.github.io/LMM_Concept_Explainability/
Open Datasets Yes In the main paper, we focus on experiments with the De PALM model [45] that is trained for captioning task on COCO dataset [27].
Dataset Splits Yes The complete dataset consists of around 120,000 images for training, and 5000 each for validation and testing with 5 captions per image following the Karpathy split.
Hardware Specification Yes Each experiment to analyze a token with a selected dictionary learning method is conduced on a single RTX5000 (24GB)/ RTX6000 (48GB)/ TITAN-RTX (24GB) GPU. ... Each experiment to extract a concept dictionary for LLa VA was conducted on a single A100 (80GB) GPU.
Software Dependencies Yes All the dictionary learning methods (PCA, KMeans, Semi-NMF) are implemented using scikitlearn [37]. ... The part of the code for representation extraction from LMM is implemented using Py Torch [36]. ... For our analyses, we also employ the OPT-6.7B model [49] from Meta AI, released under the MIT license, and the CLIP model [39] from Open AI, available under a custom usage license.
Experiment Setup Yes For uniformity and fairness all the results in the main paper are reported with number of concepts K = 20 and for token representations from L = 31, the final layer before unembedding layer. For Semi-NMF, we set λ = 1 throughout. We consider the 5 most activating samples in Xk,MAS for visual grounding for any uk. For text grounding, we consider top-15 tokens for Tk before applying the filtering described in Sec 3.5.