Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Vision Function Layer in Multimodal LLMs

Authors: Cheng Shi, Yizhou Yu, Sibei Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We comprehensively test across a diverse range of vision functions and MLLM architectures, leading to the surprising discovery of a consistent internal MLLM mechanism. This mechanism proves broadly applicable, from early MLLM iterations like the LLa VA series [23, 29] to recent models such as the Qwen series [3, 62]. Our key findings are as follows: MLLMs feature Vision Function Layers, where specific visual functions are executed within remarkably narrow layer blocks (typically 2-3 layers). Experiment Setting. To precisely identify the layers for key visual functions within MLLMs, we employ our Vision Token Swapping methodology, which measures the change rate in the outputs after token swapping. We construct dedicated paired image datasets for four key visual functions: Optical Character Recognition (OCR), Object Counting (Count), Object Recognition (Recognition), and Object Grounding (Grounding), as exemplified in Fig. 1.
Researcher Affiliation Academia Cheng Shi1,2 , Yizhou Yu2, Sibei Yang1 1Sun Yat-sen University, 2School of Computing and Data Science, The University of Hong Kong EMAIL, EMAIL, EMAIL
Pseudocode No The paper contains mathematical equations like (1) and (2) describing the MLLM process, and (3), (4), (5) describing the Vision Token Swapping and Dropping methodologies. It does not feature any explicitly labeled "Pseudocode" or "Algorithm" blocks, nor does it present structured steps in a code-like format for any procedure.
Open Source Code No https://github.com/Cheng Shiest/Vision-Function-Layer and Answer: [No] Justification: Code will be released after acceptance.
Open Datasets Yes Counting pairs, adapted from the CLEVR dataset [19], differ primarily in the quantity of a target object type... Recognition pairs, drawn from COCO [27]... We test models such as LLa VA-v1.5 (7B, 13B) [23, 29] and Qwen2.5-VL (3B, 7B) [3, 64] across a suite of benchmarks including SQA-I [33], MMMU [68], POPE [22], SEED [24], CVBench [58], Text VQA [52], OCR [30], and Chart QA [35]... We utilize the SAT [46] as training dataset... specifically LLa VA-665k [23].
Dataset Splits Yes We utilize the SAT [46] as training dataset, specifically its single-image question-answering tasks probing spatial understanding. Our base architectures are the Qwen2.5-VL models [64]. We benchmark VFL-Lo RA against two primary baselines: (1) Standard Lo RA, where Lo RA is applied uniformly across all adaptable layers, and (2) Reversed-VFL, an ablation study where Lo RA is applied to layers excluding the count-function layer range. The evaluation is conducted on a comprehensive test set comprising both in-domain spatial reasoning tasks from CV-Bench (which includes sub-tasks like Count, Relation, Depth, and Distance) and a diverse suite of out-of-domain benchmarks (such as Chart QA [35], OCRBench [30], MMMU [68], and POPE [22]) to assess broader generalization. We construct a diverse data pool consisting of 20 million vision instruction samples... We compare data subset selection strategies Oracle, Random, Expert [58], and our VFL across sample sizes ranging from 150k to 665k. We conduct experiments focusing on its ability to identify high-utility data within a more constrained and established dataset, specifically LLa VA-665k [23]. The objective was to curate an optimal 20% subset from the LLa VA-665k dataset itself for fine-tuning.
Hardware Specification Yes We use 8 H20 GPUs, and the entire training process takes approximately 3 hours. For Qwen2.5-VL-7B, we apply Lo RA fine-tuning to layers 10 through 17, as well as layers 20, 21, 22, and 23, while keeping all other parameters frozen. We perform the data classification process on 16 H20 GPUs, which takes approximately 40 hours to complete.
Software Dependencies Yes Therefore, we adopt two well-maintained toolkits: lmms-eval (v0.33) 3 and VLMEval Kit (v0.2) 4, to perform all our evaluations.
Experiment Setup Yes We set the Lo RA rank to 32, the scaling factor (alpha) to 64, and the dropout rate to 0.05. For Qwen2.5-VL-7B, we apply Lo RA fine-tuning to layers 10 through 17, as well as layers 20, 21, 22, and 23, while keeping all other parameters frozen. The entire training process takes approximately 3 hours.