Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Caption This, Reason That: VLMs Caught in the Middle

Authors: Zihan Weng, Lucas Gomez, Taylor Webb, Pouya Bashivan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our analysis reveals distinct cognitive proﬁles: while advanced models approach ceiling performance on some tasks (e.g. category identiﬁcation), a signiﬁcant gap persists, particularly in tasks requiring spatial understanding or selective attention. Investigating the source of these failures and potential methods for improvement, we employ a vision-text decoupling analysis, ﬁnding that models struggling with direct visual reasoning show marked improvement when reasoning over their own generated text captions. These experiments reveal a strong need for improved VLM Chain-of-Thought (Co T) abilities, even in models that consistently exceed human performance. Furthermore, we demonstrate the potential of targeted ﬁne-tuning on composite visual reasoning tasks and show that ﬁne-tuning smaller VLMs moderately improves core cognitive abilities.
Researcher Affiliation	Collaboration	Zihan Weng Integrated Program in Neuroscience (IPN) Mc Gill University Mila, University of Montreal Canada EMAIL Lucas Gomez Integrated Program in Neuroscience (IPN) Mc Gill University Mila, University of Montreal Canada EMAIL https://www.lucasgomez.ca/ Taylor Whittington Webb Microsoft Research USA EMAIL Pouya Bashivan Department of Physiology Mc Gill University Mila, University of Montreal Canada EMAIL
Pseudocode	No	The paper includes 'Prompt & Script Examples' in Appendix A.11, which contain actual Python code snippets for self-captioning and evaluation. However, it does not include structured pseudocode or algorithm blocks that describe an algorithm in an abstract, language-agnostic format typically referred to as 'pseudocode'.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justiﬁcation: The paper does not provide open access to the code or data. However, it includes sufﬁciently detailed methodological descriptions in the Methods section and the appendix.
Open Datasets	Yes	We utilize the iWISDM task environment [22] to generate all cognitive tasks and ﬁne-tuning data in this study. This environment enables the procedural generation of an effectively limitless number of vision-language decision-making tasks. ... We used ShapeNet objects [54], which include images of 3D-rendered everyday objects taken at various viewing angles. ... Furthermore, models PAM and CVR accuracies are signiﬁcantly correlated to their performance on widely used benchmarks such as MMMU-Pro (Figures A.7.1), which further validates the effectiveness of the evaluations presented here. ... To further assess generalization, we benchmark the Lo RA ﬁne-tuned models along with the base model on MMBench, MMMU-Pro and VQAv2.
Dataset Splits	Yes	To achieve this, we used iWISDM [22] to generate task sets of varying sizes: 100, 1,000, and 10,000 tasks. For each task, 10 trials were generated, resulting in training sets comprising 1,000, 10,000, and 100,000 trials, respectively. ... It is also important to note that there is no overlap between the training and CVR evaluation datasets.
Hardware Specification	Yes	All ﬁne-tuning experiments were performed on 4 NVIDIA A5000 GPUs.
Software Dependencies	No	All variations of Qwen2.5-VL and LLa Va-One Vision were hosted with llama-factory [58] and evaluated with the Open AI-style API. The Intern VL-2.5 and Mini CPM-V 2.6 were deployed using Hugging Face Transformers. The paper mentions the software used (llama-factory, Open AI-style API, Hugging Face Transformers) but does not provide specific version numbers for these dependencies, which is required for a reproducible description.
Experiment Setup	Yes	Table A.3.4: Overview of the hyperparameters for Qwen2.5-VL-7B-Instruct Lo RA Fine-tuning. N_tasks 100 1000 10000 N_trials 1000 10000 100000 N_epochs 10 Batch_size 1 Gradient_accum 32 Scheduler Cosine Peak_LR 4e-05 Warmup 1 7 70 Mixed_precision bf16 Optimizer Adam W(0.01) Lo RA_rank 8 Lo RA_alpha 16 Lo RA_dropout 0.2 Lo RA_targets vision projector, LLM