Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework
Authors: Paul Pu Liang, Yun Cheng, Xiang Fan, Chun Kai Ling, Suzanne Nie, Richard Chen, Zihao Deng, Nicholas Allen, Randy Auerbach, Faisal Mahmood, Russ R. Salakhutdinov, Louis-Philippe Morency
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate PID estimation, we conduct extensive experiments on both synthetic datasets where the PID is known and on large-scale multimodal benchmarks where PID estimations are compared with human annotations. Finally, we demonstrate their usefulness in (1) quantifying interactions within multimodal datasets, (2) quantifying interactions captured by multimodal models, (3) principled approaches for model selection, and (4) three real-world case studies engaging with domain experts in pathology, mood prediction, and robotic perception where our framework helps to recommend strong multimodal models for each application. |
| Researcher Affiliation | Academia | Paul Pu Liang1, Yun Cheng1,2, Xiang Fan1,3, Chun Kai Ling1,8, Suzanne Nie1, Richard J. Chen4,5, Zihao Deng6, Nicholas Allen7, Randy Auerbach8, Faisal Mahmood4,5, Ruslan Salakhutdinov1, Louis-Philippe Morency1 1CMU, 2Princeton University, 3UW, 4Harvard Medical School, 5Brigham and Women s Hospital, 6University of Pennsylvania, 7University of Oregon, 8Columbia University |
| Pseudocode | Yes | Algorithm 1 BATCH algorithm. |
| Open Source Code | Yes | Finally, we make public a suite of trained models across 10 model families and 30 datasets to accelerate future analysis of multimodal interactions at https://github.com/pliang279/PID. |
| Open Datasets | Yes | We use a large collection of real-world datasets in Multi Bench [60] which test multimodal fusion of different input signals (including images, video, audio, text, time-series, sets, and tables) for different tasks (predicting humor, sentiment, emotions, mortality rate, ICD-9 codes, image-captions, human activities, digits, and design interfaces). |
| Dataset Splits | Yes | For each dataset, we train a suite of models on the train set Dtrain and apply it to the validation set Dval, yielding a predicted dataset Dpred = {(x1, x2, ˆy) Dval}. |
| Hardware Specification | No | The paper mentions "NVIDIA's GPU support" in the acknowledgements, but it does not specify any particular GPU model, CPU, memory, or other specific hardware configurations used for running the experiments. |
| Software Dependencies | Yes | We implemented (5) using CVXPY [25, 1]. The transformation from the max-entropy objective (16) to (5) ensures adherence to disciplined convex programs [40], thus allowing CVXPY to recognize it as a convex program. All 3 conic solvers, ECOS [27], SCS [75], and MOSEK [6] were used, with ECOS and SCS being default solvers packaged with CVXPY. Our experience is that MOSEK is the fastest and most stable solver, working out of the box without any tuning. However, it comes with the downside of being commercial. For smaller problems, ECOS and SCS work just fine. |
| Experiment Setup | Yes | For the neural network in Algorithm 1, we use a 3-layer feedforward neural network with a hidden dimension of 32. We train the network for 10 epochs using the Adam optimizer with a batch size of 256 and learning rate of 0.001. |