Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs

Authors: Shmuel Berman, Jia Deng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present an evaluation that tests vision-language models capacity for nonlocal visual reasoning... We conduct a comprehensive evaluation of leading VLMs (including GPT-5, GEMINI 2.5 PRO, and CLAUDE SONNET 4) demonstrating that even flagship models lag far behind humans on trivial visual reasoning tasks, despite advances in primitive vision.
Researcher Affiliation Academia Shmuel Berman Jia Deng Princeton University, Department of Computer Science EMAIL
Pseudocode No The paper describes the tasks and methodologies conceptually and provides examples, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Our generator, evaluation sets, and evaluation code are available here.
Open Datasets Yes We introduce a procedurally-generated evaluation set comprising three task categories designed to be trivial for humans and require minimal prior knowledge. We present three task categories: Object Re-Identification, Visual Scavenger Hunt, and Circuit Connections. ... Our generator, evaluation sets, and evaluation code are available here.
Dataset Splits Yes The examples are generated from a uniform distribution between Yes and No , so the random guessing baseline is 50%. ... We evaluate the models on chain-lengths of 2, 3, and 4 to test if their performance degrades over long horizons. ... The rendered image contains between 4-10 components, chosen from a uniform distribution. ... We evaluate on only 200 or 125 examples per variant for cost-efficiency reasons.
Hardware Specification Yes For the open-source models, we evaluated locally on a system equipped with three NVIDIA A6000 GPUs (48GB VRAM each) and dual Intel(R) Xeon(R) Gold 5220R CPUs @ 2.20GHz.
Software Dependencies Yes We evaluated CLAUDE 3.7 SONNET, GEMINI 2.5 PRO, and CLAUDE SONNET 4 via Open Router at model codes anthropic/claude-3.7-sonnet:thinking, google/gemini-2.5-pro-preview, and anthropic/claude-sonnet-4, respectively. ... The Open AI models were evaluated via the Open AI API at model codes o4-mini-2025-04-16, o3-2025-04-16, and gpt-5-2025-08-07.
Experiment Setup Yes We evaluate GEMINI 2.5 PRO, CLAUDE SONNET 4, GPT-5, as well as other closed and open-source models on our benchmark in a few-shot setting. ... All error bars represent standard error. The full evaluation details, including prompts, can be found in the Appendix.