Understanding Information Storage and Transfer in Multi-Modal Large Language Models

Authors: Samyadeep Basu, Martin Grayson, Cecily Morrison, Besmira Nushi, Soheil Feizi, Daniela Massiceti

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use these tools to study two open-source MLLMs, LLa Va and multi-modal Phi-2. Our key findings show that these MLLMs rely on MLP and self-attention blocks in much earlier layers for information storage, compared to LLMs whose mid-layer MLPs are more important. We also show that a consistent small subset of visual tokens output by the vision encoder are responsible for transferring information from the image to these causal blocks. We validate these mechanisms by introducing MULTEDIT, a model-editing algorithm that can correct errors and insert new long-tailed information into MLLMs by targeting these causal blocks.
Researcher Affiliation Collaboration Samyadeep Basu University of Maryland Martin Grayson Microsoft Research Cecily Morrison Microsoft Research Besmira Nushi Microsoft Research Soheil Feizi University of Maryland Daniela Massiceti Microsoft Research
Pseudocode No The paper describes algorithms and methods but does not provide a formal pseudocode or algorithm block.
Open Source Code No We will provide the final cleaned code with the camera-ready version of the paper. However, for the time being, we have provided all the experimental details in fine-grained details to reproduce our results.
Open Datasets Yes We also introduce VQA-Constraints, a new dataset of 9.7k factual questions annotated with constraints, spanning natural images (from OK-VQA [22], Wiki Movies [37], and Known [12]).
Dataset Splits No The paper describes the VQA-Constraints dataset and its subsets (OK-VQA, Multimodal Movies, Multimodal Known), but it does not provide explicit training, validation, or testing splits with percentages or counts for the experiments conducted on these datasets, other than mentioning the use of the 'test-set of OK-VQA' in Appendix B.
Hardware Specification Yes All our experiments are performed on Nvidia-A6000 and A5000 GPUs.
Software Dependencies No The paper mentions the use of GPT-4 for annotations and implicitly PyTorch as a deep learning framework, but it does not specify version numbers for any key software components or libraries.
Experiment Setup Yes Hyperparameters. We use a learning rate of 0.1 to optimize for the values using Adam Optimizer. For the regularization factor λ, we use 0.01 after a grid search. Amongst the set of early causal layers between 0-5, we find editing Layer 2 to result in the best editing efficacy.