Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion
Authors: Ishaan Singh Rawal, Alexander Matyasko, Shantanu Jaiswal, Basura Fernando, Cheston Tan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Hence, to provide insights, we design QUAG (QUadrant Avera Ge), a lightweight and non-parametric probe, to conduct dataset-model combined representation analysis by impairing modality fusion. We find that the models achieve high performance on many datasets without leveraging multimodal representations. To validate QUAG further, we design QUAG-attention, a less-expressive replacement of self-attention with restricted token interactions. Models with QUAG-attention achieve similar performance with significantly fewer multiplication operations without any finetuning. Our findings raise doubts about the current models abilities to learn highly-coupled multimodal representations. Hence, we design the CLAVI (Complements in LAnguage and VIdeo) dataset, a stresstest dataset curated by augmenting real-world videos to have high modality coupling. Consistent with the findings of QUAG, we find that most of the models achieve near-trivial performance on CLAVI. |
| Researcher Affiliation | Academia | 1Centre for Frontier AI Research, Agency for Science, Technology & Research, Singapore 2 Institute of High Performance Computing, Agency for Science, Technology & Research, Singapore 3Institute for Infocomm Research, Agency for Science, Technology & Research, Singapore. Correspondence to: Ishaan Singh Rawal <rawal ishaan singh@cfar.a-star.edu.sg>. |
| Pseudocode | Yes | We provide the code in the Appendix A.4. Since we will be applying the QUAG operator successively on all the layers of M, for brevity, we denote Φ(M, n, S) = i [1, ,n] Ai ϕ(Ai, S). Note that QUAG is light-weight, non-parametric, requires no finetuning and operates at inference time for combined datasetmodel analysis. [...] A.4 provides code snippets under `def self_attention` and `def quag_attention`. |
| Open Source Code | Yes | (project page: https: //dissect-videoqa.github.io). [...] Incorporating QUAG into the existing model pipeline is straightforward and we provide the code in the Appendix A.4. |
| Open Datasets | Yes | CLAVI is curated by leveraging Charades-STA (https://prior.allenai.org/projects/data/charades/license.txt) (Gao et al., 2017) |
| Dataset Splits | Yes | Using the dataset generation strategy as described in the main paper, we sample 40,000 data points (24,000: training, 8,000: validation, 8,000: testing). |
| Hardware Specification | Yes | A.6. Experiment Details for Real World Data: All our experiments were performed on 4 NVIDIA A5000 GPUs. |
| Software Dependencies | No | The paper mentions using "Adam optimizer" and states that code is provided in the appendix (A.4) and they used "official open-source code of the models" (A.6), implying dependencies are handled there. However, it does not list specific software libraries or packages with their version numbers (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x). |
| Experiment Setup | Yes | A.5. Training and Data Details for Simulation Study: We train a 4-layer transformer model, with learnable modality encoding and sinusoidal position encoding, having a dimensionality of 100. Following conventions, we set the hidden dimension to four times the hidden dimension (400). To prevent overfitting, we add dropouts in the embedding, attention, and penultimate layers. We Adam optimizer with a learning rate of 0.001 for 2000 epochs. The training batch size was 1024. [...] B.5. Experiment Details: Table 14 details Hyperparameters and checkpoint details of CLAVI finetuning experiment, including Epochs and LR for different models. |