Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs

Authors: Mustafa Shukor, Matthieu Cord

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we expose frozen LLMs to image, video, audio and text inputs and analyse their internal representation aiming to understand their generalization beyond textual inputs. Our work provides the following findings. Perceptual tokens (1) are easily distinguishable from textual ones inside LLMs, with significantly different representations (e.g. live in different narrow cones), and complete translation to textual tokens does not exist. Yet, (2) both perceptual and textual tokens activate similar LLM weights. Despite being different, (3) perceptual tokens are implicitly aligned to textual tokens inside LLMs, we call this the implicit multimodal alignment effect (IMA), and argue that this is linked to architectural design, helping LLMs to generalize. This provide more evidence to believe that the generalization of LLMs to multimodal inputs is mainly due to their architecture. These findings lead to several implications. (1) We find a positive correlation between the implicit alignment score and the task performance, suggesting that this could act as a proxy metric for model evaluation and selection.
Researcher Affiliation Collaboration Mustafa Shukor1 Matthieu Cord1,2 1Sorbonne University, 2Valeo.ai
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available here: https://github.com/mshukor/ima-lmms.
Open Datasets Yes ST setup. We consider a wide range of public multimodal datasets that cover 2 representative tasks: captioning and question-answering (QA) across image (VQAv2 [105], GQA [106], OKVQA [107], COCO caption [108]), video (MSVD, MSRVTQA [109], MSRVTT [110]), audio (Audiocaps [111], Clotho [112], Clotho-AQA [113]) and language tasks. For QA datasets we report the accuracy (in open-ended generation setup with exact match), and for captioning we report the CIDEr metric. MT Setup. We also evaluate the MT setup on recent datasets such as SEED [114], Text VQA [115] and POPE [50].
Dataset Splits Yes We train with a total batch size of 16 for captioning and 64 for VQA datasets. The number of epochs is set to 20 to ensure that all models converged, though most of these models converge after only couple of epochs. We select the best checkpoint for evaluation. For example the model for image captioning converged after 4 epochs. In Table 2, column headers include 'Acc (Val)' for VQAv2, GQA, OKVQA.
Hardware Specification Yes All models are trained on 8 V100 GPUs and the training time depends on the task, e.g., for the large VQAv2 dataset each epoch takes 30 mins, for other smaller datasets it takes less time, e.g., 10 mins for Audiocaps and MSVD-QA.
Software Dependencies No The paper mentions 'Adam W optimizer' and various models (e.g., Vicuna-v1.5, LLaVA-1.5, ViT, TimesFormer, AST, CLIP, MAE) but does not specify software versions for programming languages, frameworks, or libraries used for implementation (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes To train these baselines, we use Adam W optimizer with a learning rate of 2e-4 that decreases with a cosine annealing scheduler to a minimium of 1e-5. We train with a total batch size of 16 for captioning and 64 for VQA datasets. The number of epochs is set to 20 to ensure that all models converged, though most of these models converge after only couple of epochs. We select the best checkpoint for evaluation.