Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Visual Structures Help Visual Reasoning: Addressing the Binding Problem in LVLMs

Authors: Amirmohammad Izadi, Mohammadali Banayeeanzade, Fatemeh Askari, Ali Rahimiakbar, Mohammad Vahedi, Hosein Hasani, Mahdieh Baghshah

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate substantial performance improvements across core visual reasoning tasks, using only a single-query inference. Speciﬁcally, VISER improves GPT-4o performance on visual search, counting, and spatial relationship tasks by 25.0%, 26.8%, and 9.5%, respectively, and reduces edit distance error in scene description by 0.32 on 2D datasets. VISER underscores the importance of visual input design over purely linguistically based reasoning strategies and suggests that visual structuring is a powerful and general approach for enhancing compositional and spatial reasoning in LVLMs.
Researcher Affiliation	Academia	Department of Computer Engineering Sharif University of Technology
Pseudocode	No	The paper describes methods and processes through textual descriptions and visual examples in figures, such as Figure 1 and Figure 2, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format that would qualify as pseudocode.
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justiﬁcation: Codes and information of datasets that are constructed or reused in the paper are anonymized and included in the main paper and supplementary material.
Open Datasets	Yes	Datasets To evaluate VISER, we use both synthetic and natural datasets. A Binding Problem Generator [9] produces synthetic data that can be controlled in two and three dimensions and can incorporate a variable number of objects. Additionally, we benchmark on two real-world tasks: Learning To Count Everything [32] and the Spatial Reasoning [38] datasets.
Dataset Splits	Yes	2D Scenes: We created scenes containing 20, 30, 40, and 50 objects. Each conﬁguration comprised 100 images (50 with targets present and 50 absent), totaling 400 2D scenes. 3D Scenes: Using the same object count progression, we generated 50 images per count (25 with targets present and 25 absent), resulting in 200 3D scenes. ... For each object count, we produced 100 distinct images, resulting in a total of 600 2D images. ... We generated 200 synthetic 2D scenes with controlled object conﬁgurations. For each scene, we created multiple-choice questions for each scene, asking models to select the answer that best describes the spatial relation between the target objects.
Hardware Specification	No	Question: For each experiment, does the paper provide sufﬁcient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [No] Justiﬁcation: Experiments were conducted via external model APIs, so hardware, runtime, call counts, and token totals are unavailable, leaving compute requirements unspeciﬁed.
Software Dependencies	No	The paper mentions proprietary models like 'Open AIs GPT-4o [39]' and 'Anthropics Claude3.5-sonnet [40]', and open-source models like 'Qwen2.5-VL-7B-Instruct [41]' and 'LLa Ma4-scout-17b-16e-Instruct [42]', but does not list specific software dependencies with version numbers for the implementation of their proposed method (VISER).
Experiment Setup	Yes	The value of n is set to 3 in our experiments. ... All experiments are conducted on a consistent subset of our 2D synthetic dataset [9], with varying object counts per scene. ... To minimize non-determinism, all evaluations use greedy decoding (temperature = 0), ensuring deterministic outputs for each input.