Understanding Information Storage and Transfer in Multi-Modal Large Language Models
Authors: Samyadeep Basu, Martin Grayson, Cecily Morrison, Besmira Nushi, Soheil Feizi, Daniela Massiceti
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use these tools to study two open-source MLLMs, LLa Va and multi-modal Phi-2. Our key findings show that these MLLMs rely on MLP and self-attention blocks in much earlier layers for information storage, compared to LLMs whose mid-layer MLPs are more important. We also show that a consistent small subset of visual tokens output by the vision encoder are responsible for transferring information from the image to these causal blocks. We validate these mechanisms by introducing MULTEDIT, a model-editing algorithm that can correct errors and insert new long-tailed information into MLLMs by targeting these causal blocks. |
| Researcher Affiliation | Collaboration | Samyadeep Basu University of Maryland Martin Grayson Microsoft Research Cecily Morrison Microsoft Research Besmira Nushi Microsoft Research Soheil Feizi University of Maryland Daniela Massiceti Microsoft Research |
| Pseudocode | No | The paper describes algorithms and methods but does not provide a formal pseudocode or algorithm block. |
| Open Source Code | No | We will provide the final cleaned code with the camera-ready version of the paper. However, for the time being, we have provided all the experimental details in fine-grained details to reproduce our results. |
| Open Datasets | Yes | We also introduce VQA-Constraints, a new dataset of 9.7k factual questions annotated with constraints, spanning natural images (from OK-VQA [22], Wiki Movies [37], and Known [12]). |
| Dataset Splits | No | The paper describes the VQA-Constraints dataset and its subsets (OK-VQA, Multimodal Movies, Multimodal Known), but it does not provide explicit training, validation, or testing splits with percentages or counts for the experiments conducted on these datasets, other than mentioning the use of the 'test-set of OK-VQA' in Appendix B. |
| Hardware Specification | Yes | All our experiments are performed on Nvidia-A6000 and A5000 GPUs. |
| Software Dependencies | No | The paper mentions the use of GPT-4 for annotations and implicitly PyTorch as a deep learning framework, but it does not specify version numbers for any key software components or libraries. |
| Experiment Setup | Yes | Hyperparameters. We use a learning rate of 0.1 to optimize for the values using Adam Optimizer. For the regularization factor λ, we use 0.01 after a grid search. Amongst the set of early causal layers between 0-5, we find editing Layer 2 to result in the best editing efficacy. |