Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
The Narrow Gate: Localized Image-Text Communication in Native Multimodal Models
Authors: Alessandro Serra, Francesco Ortu, Emanuele Panizon, Lucrezia Valeriani, Lorenzo Basile, Alessio Ansuini, Diego Doimo, Alberto Cazzaniga
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This study investigates how vision-language models (VLMs) handle image-understanding tasks, focusing on how visual information is processed and transferred to the textual domain. We compare native multimodal VLMs... and non-native multimodal VLMs... We show that ablating this single token significantly deteriorates image-understanding performance, whereas targeted, token-level interventions reliably steer image semantics and downstream text with fine-grained control. We study visual understanding tasks where the model is asked to answer questions from VQAv2 [37], generate image captions for Flickr30k [38] and MS-COCO [39], and complete simple prompts about Image Net images [40]. Results. In native multimodal models, ablating the [EOI] token causes a substantial performance collapse across all tasks. |
| Researcher Affiliation | Collaboration | 1 Area Science Park, Trieste, Italy 2 SISSA, Trieste, Italy 3 University of Trieste, Trieste, Italy |
| Pseudocode | No | The paper describes analytical tools and methods in prose (e.g., "Analyzing information flow in VLMs", "Quantifying cross-modal attention", "Blocking Cross-Modal communication with Attention Knockout") and does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code needed to reproduce the experiments is available at: ritareasciencepark.github.io/Narrowgate. |
| Open Datasets | Yes | We study visual understanding tasks where the model is asked to answer questions from VQAv2 [37], generate image captions for Flickr30k [38] and MS-COCO [39], and complete simple prompts about Image Net images [40]. |
| Dataset Splits | Yes | Ablation setup. For each model, we randomly sample 2,000 examples per dataset and measure baseline performance before applying targeted attention knockouts. Emu3-Gen finetuning. We fine-tuned Emu3 on a mixture of datasets using 37.5k samples from the VQAv2 [37] and MS-COCO-2014 [39] training sets and 150k samples the LLa VA-instruct-150K using the Hugging Face implementation. Removing the narrow gate with fine-tuning. We fine-tuned Chameleon-7b and Emu3 on a dataset of 15k samples, 7.5k samples from the LLa VA-instruct-150K training set, and 7.5k from the VQAv2 training set. |
| Hardware Specification | Yes | We run all the experiments on a single NVIDIA A100 GPUs with 40GB VRAM. For the Chameleon 34B, we used two 40GB GPUs. |
| Software Dependencies | No | We used the Hugging Face implementations of Chameleon [46, 47], Emu3-Gen [48], LLa VA [49], Pixtral [50], VILA-U [31] and Janus [51] models. While specific implementations are mentioned, explicit version numbers for the underlying software stack (e.g., Python, PyTorch, CUDA) are not provided. |
| Experiment Setup | Yes | Emu3-Gen finetuning. We fine-tuned Emu3 for one epoch with a batch size of 64 using Lo RA with a rank of 64, an alpha of 16, Adam optimizer without weight decay, and cosine annealing scheduler starting with a learning rate of 5e-5. Removing the narrow gate with fine-tuning. We fine-tuned Chameleon-7b and Emu3 on a dataset of 15k samples, 7.5k samples from the LLa VA-instruct-150K training set, and 7.5k from the VQAv2 training set. We fine-tuned the models for one epoch with a batch size of 128, using Lo RA with a rank of 128, alpha of 256, Adam optimizer with weight decay 0.1, cosine annealing learning scheduler with a starting learning rate of 2e-5. |