MultiViz: Towards Visualizing and Understanding Multimodal Models

Authors: Paul Pu Liang, Yiwei Lyu, Gunjan Chhablani, Nihal Jain, Zihao Deng, Xingbo Wang, Louis-Philippe Morency, Ruslan Salakhutdinov

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experiments on 8 trained models across 6 real-world tasks, we show that the complementary stages in MULTIVIZ together enable users to (1) simulate model predictions, (2) assign interpretable concepts to features, (3) perform error analysis on model misclassifications, and (4) use insights from error analysis to debug models.
Researcher Affiliation Academia Paul Pu Liang1, Yiwei Lyu2, Gunjan Chhablani3, Nihal Jain1, Zihao Deng1, Xingbo Wang4, Louis-Philippe Morency1, Ruslan Salakhutdinov1 1Carnegie Mellon University, 2University of Michigan, 3Georgia Tech, 4HKUST
Pseudocode Yes We summarize these proposed approaches for understanding each step of the multimodal process in Table 4, and show the overall pipeline in Algorithm 1 and Figure 1.
Open Source Code Yes MULTIVIZ is publicly available, will be regularly updated with new interpretation tools and metrics, and welcomes inputs from the community. [...] MULTIVIZ datasets, models, and code are at https: //github.com/pliang279/Multi Viz.
Open Datasets Yes Setup: We use a large suite of datasets from Multi Bench (Liang et al., 2021a) which span real-world fusion (Zadeh et al., 2018; Arevalo et al., 2017; Johnson et al., 2016), retrieval (Plummer et al., 2015), and QA (Johnson et al., 2017; Goyal et al., 2017) tasks.
Dataset Splits Yes Train, validation, and test splits: Each dataset contains several videos, and each video is further split into short segments (roughly 10 20 seconds) that are annotated. We split the data at the level of videos so that segments from the same video will not appear across train, valid, and test splits. This enables us to train user-independent models instead of having a model potentially memorizing the average affective state of a user. There are a total of 16, 265, 1, 869, and 4, 643 segments in train, valid, and test datasets respectively for a total of 22, 777 data points.
Hardware Specification No Computational resources: Preparations for all experiments (i.e. generating the necessary visualizations for the points for each dataset) are done on a private server with 2 GPUs. The preparation time for model simulation experiment using 2 GPUs is about 12 hours for VQA, 1 hour for MM-IMDb and 2 hours for CMU-MOSEI. For the representation interpretation experiment, we generated all visualizations for the VQA data points in the experiment in about 3 hours on 1 GPU.
Software Dependencies No One additional major contribution of our works is that we designed a code framework in Python for easy analysis, interpretation and visualization of models on multimodal datasets with only a few lines of code.
Experiment Setup Yes Under each of these active learning settings, we finetune the last layer of LXMERT with the N selected points from U set for one epoch (batch size 32, learning rate tuned to the best performance), and the result is evaluated on the T set.