Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Head Pursuit: Probing Attention Specialization in Multimodal Transformers

Authors: Lorenzo Basile, Valentino Maiorca, Diego Doimo, Francesco Locatello, Alberto Cazzaniga

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show consistent patterns of specialization at the head level across both unimodal and multimodal transformers. Remarkably, we find that editing as few as 1% of the heads, selected using our method, can reliably suppress or enhance targeted concepts in the model output. We validate our approach on language tasks such as question answering and toxicity mitigation, as well as vision-language tasks including image classification and captioning.
Researcher Affiliation Collaboration 1Area Science Park, Italy 2Sapienza University of Rome, Italy 3Institute of Science and Technology, Austria
Pseudocode Yes Algorithm 1: Simultaneous Orthogonal Matching Pursuit (SOMP)
Open Source Code Yes Our code is available at https://github.com/lorenzobasile/Head Pursuit.
Open Datasets Yes We consider a generative LLM, Mistral-7B [37], and evaluate it on textual prompts from the Trivia QA [36] question answering dataset. We consider two datasets, Real Toxicity Prompts (RTP) [38], which contains naturally occurring Web prompts, and Thoroughly Engineered Toxicity (TET) [39], a benchmark with carefully constructed test cases... For this experiment, we benchmark LLa VA-Ne XT-7B [4] (from now on just LLa VA for short) on a range of image classification datasets, including: MNIST [42], SVHN [43], GTSRB [44], Euro SAT [45], RESISC45 [46] and DTD [47]. We consider the Flickr30k dataset [50], and evaluate the possibility of promoting or reducing the presence of words belonging to specific semantic areas in the captions generated by LLa VA
Dataset Splits Yes country names account for over 6% of the answers in the test split. For these purposes, to ensure a fair comparison, we explicitly avoid selecting heads previously identified as specialized when constructing the control set. In all experiments, we report such random control results over 10 independently sampled sets of heads. Importantly, head representations are computed using questions from the training data, which is strictly disjoint from the data used in evaluation.
Hardware Specification Yes All the experiments were executed on a single NVIDIA H100 GPU equipped with 80GB VRAM.
Software Dependencies No To perform our experiments we employed pre-trained model checkpoints implemented in the Hugging Face transformers library [52]. Detailed information on such resources is provided in Appendix C. All the experiments were executed on a single NVIDIA H100 GPU equipped with 80GB VRAM.
Experiment Setup Yes Our goal is to identify attention heads specialized in generating country names, a target attribute motivated by their relative abundance in the dataset: despite their specificity, country names account for over 6% of the answers in the test split. For targeting the countries concept, we restrict the tokens in our dictionary (unembedding matrix) to those corresponding to names of countries, and apply our Matching Pursuit based method to select specialized heads. As an additional baseline, we report results obtained by inverting heads selected with a simple adaptation of the Logit Lens. Specifically, we score each head with the mean logit assigned to country-related tokens by LL, and select top-k heads as in our method. In this experiment, we consider two opposite intervention directions: one inhibitory, as in previous experiments, and one enhancing. In the former case, the objective is to make the model produce meaningful captions that do not contain the target property (e.g., colors), while in the latter the aim becomes to enhance the target property, while preserving the model s capabilities in generating meaningful descriptions. The two setups reduce to rescaling selected heads by a coefficient that is α = 1 in the negative case and α > 1 in the positive case. Heads are selected using SOMP on a dictionary of tokens corresponding to lists of keywords regarding colors, sentiments and quantity, with the first two adopted from [27] and the latter manually curated. We evaluate various choices of α in the Appendix in Figure 10, and choose α = 5 for our experiments as it guarantees a reasonable trade-off between caption quality and attribute enhancement.