Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Caption This, Reason That: VLMs Caught in the Middle

Authors: Zihan Weng, Lucas Gomez, Taylor Webb, Pouya Bashivan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our analysis reveals distinct cognitive profiles: while advanced models approach ceiling performance on some tasks (e.g. category identification), a significant gap persists, particularly in tasks requiring spatial understanding or selective attention. Investigating the source of these failures and potential methods for improvement, we employ a vision-text decoupling analysis, finding that models struggling with direct visual reasoning show marked improvement when reasoning over their own generated text captions. These experiments reveal a strong need for improved VLM Chain-of-Thought (Co T) abilities, even in models that consistently exceed human performance. Furthermore, we demonstrate the potential of targeted fine-tuning on composite visual reasoning tasks and show that fine-tuning smaller VLMs moderately improves core cognitive abilities.
Researcher Affiliation Collaboration Zihan Weng Integrated Program in Neuroscience (IPN) Mc Gill University Mila, University of Montreal Canada EMAIL Lucas Gomez Integrated Program in Neuroscience (IPN) Mc Gill University Mila, University of Montreal Canada EMAIL https://www.lucasgomez.ca/ Taylor Whittington Webb Microsoft Research USA EMAIL Pouya Bashivan Department of Physiology Mc Gill University Mila, University of Montreal Canada EMAIL
Pseudocode No The paper includes 'Prompt & Script Examples' in Appendix A.11, which contain actual Python code snippets for self-captioning and evaluation. However, it does not include structured pseudocode or algorithm blocks that describe an algorithm in an abstract, language-agnostic format typically referred to as 'pseudocode'.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The paper does not provide open access to the code or data. However, it includes sufficiently detailed methodological descriptions in the Methods section and the appendix.
Open Datasets Yes We utilize the iWISDM task environment [22] to generate all cognitive tasks and fine-tuning data in this study. This environment enables the procedural generation of an effectively limitless number of vision-language decision-making tasks. ... We used ShapeNet objects [54], which include images of 3D-rendered everyday objects taken at various viewing angles. ... Furthermore, models PAM and CVR accuracies are significantly correlated to their performance on widely used benchmarks such as MMMU-Pro (Figures A.7.1), which further validates the effectiveness of the evaluations presented here. ... To further assess generalization, we benchmark the Lo RA fine-tuned models along with the base model on MMBench, MMMU-Pro and VQAv2.
Dataset Splits Yes To achieve this, we used iWISDM [22] to generate task sets of varying sizes: 100, 1,000, and 10,000 tasks. For each task, 10 trials were generated, resulting in training sets comprising 1,000, 10,000, and 100,000 trials, respectively. ... It is also important to note that there is no overlap between the training and CVR evaluation datasets.
Hardware Specification Yes All fine-tuning experiments were performed on 4 NVIDIA A5000 GPUs.
Software Dependencies No All variations of Qwen2.5-VL and LLa Va-One Vision were hosted with llama-factory [58] and evaluated with the Open AI-style API. The Intern VL-2.5 and Mini CPM-V 2.6 were deployed using Hugging Face Transformers. The paper mentions the software used (llama-factory, Open AI-style API, Hugging Face Transformers) but does not provide specific version numbers for these dependencies, which is required for a reproducible description.
Experiment Setup Yes Table A.3.4: Overview of the hyperparameters for Qwen2.5-VL-7B-Instruct Lo RA Fine-tuning. N_tasks 100 1000 10000 N_trials 1000 10000 100000 N_epochs 10 Batch_size 1 Gradient_accum 32 Scheduler Cosine Peak_LR 4e-05 Warmup 1 7 70 Mixed_precision bf16 Optimizer Adam W(0.01) Lo RA_rank 8 Lo RA_alpha 16 Lo RA_dropout 0.2 Lo RA_targets vision projector, LLM