reproducibilityindex.ai

A Concept-Based Explainability Framework for Large Multimodal Models

Authors: Jayneel Parekh, Pegah KHAYATAN, Mustafa Shukor, Alasdair Newson, Matthieu Cord

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We qualitatively and quantitatively evaluate the results of the learnt concepts. We show that the extracted multimodal concepts are useful to interpret representations of test samples. Finally, we evaluate the disentanglement between different concepts and the quality of grounding concepts visually and textually. Our implementation is publicly available.
Researcher Affiliation	Collaboration	1ISIR, Sorbonne Université, Paris, France 2Valeo.ai, Paris, France
Pseudocode	No	The paper describes methods in text and provides a figure (Fig. 1) as a visual summary, but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our implementation is publicly available.1 1Project page and code: https://jayneelparekh.github.io/LMM_Concept_Explainability/
Open Datasets	Yes	In the main paper, we focus on experiments with the De PALM model [45] that is trained for captioning task on COCO dataset [27].
Dataset Splits	Yes	The complete dataset consists of around 120,000 images for training, and 5000 each for validation and testing with 5 captions per image following the Karpathy split.
Hardware Specification	Yes	Each experiment to analyze a token with a selected dictionary learning method is conduced on a single RTX5000 (24GB)/ RTX6000 (48GB)/ TITAN-RTX (24GB) GPU. ... Each experiment to extract a concept dictionary for LLa VA was conducted on a single A100 (80GB) GPU.
Software Dependencies	Yes	All the dictionary learning methods (PCA, KMeans, Semi-NMF) are implemented using scikitlearn [37]. ... The part of the code for representation extraction from LMM is implemented using Py Torch [36]. ... For our analyses, we also employ the OPT-6.7B model [49] from Meta AI, released under the MIT license, and the CLIP model [39] from Open AI, available under a custom usage license.
Experiment Setup	Yes	For uniformity and fairness all the results in the main paper are reported with number of concepts K = 20 and for token representations from L = 31, the final layer before unembedding layer. For Semi-NMF, we set λ = 1 throughout. We consider the 5 most activating samples in Xk,MAS for visual grounding for any uk. For text grounding, we consider top-15 tokens for Tk before applying the filtering described in Sec 3.5.