Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting

Authors: Anand Bhattad, Konpat Preechakul, Alexei A Efros

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate our approach through a quantitative pairwise evaluation as well as qualitative, full-scene decompositions. 4 Evaluation Human visual inspection, while qualitative, remains the most natural way to evaluate Visual Jenga. To complement this qualitative assessment, we also perform an automatic quantitative evaluation. Our evaluation comprises three parts: pair-wise object ordering (Sec. 4.1), complete scene decomposition (Sec. 4.2), and comparison to simple heuristics (Sec. 4.3). All evaluation data are provided in Supp.
Researcher Affiliation	Academia	Anand Bhattad1 Konpat Preechakul2 Alexei A. Efros2 1Johns Hopkins University 2University of California, Berkeley
Pseudocode	No	The paper describes its method in sections 3.2 and 3.3, outlining steps and showing a pipeline diagram in Figure 4. However, it does not present a clearly labeled algorithm block or pseudocode.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We provide data, and we will release the code upon publication.
Open Datasets	Yes	Full dataset availability All images in our evaluation datasets are provided as HTML webpages in the project page for comprehensive inspection. This includes: 1. Full Scene Decomposition dataset: 56 scenes collected both from our own photography and from internet searches using keywords such as "messy desk", "messy room", and "stacked objects". 2. Pair-wise object ordering dataset: NYU-v2: We use NYU Depth V2 dataset [66]... COCO: We manually collected 200 random images from COCO dataset [43]... Cluttered Parse: ... we also created Cluttered Parse dataset.
Dataset Splits	Yes	4 Evaluation Human visual inspection, while qualitative, remains the most natural way to evaluate Visual Jenga. To complement this qualitative assessment, we also perform an automatic quantitative evaluation. Our evaluation comprises three parts: pair-wise object ordering (Sec. 4.1), complete scene decomposition (Sec. 4.2), and comparison to simple heuristics (Sec. 4.3). 4.1 Pair-wise object ordering NYU-v2: ...extracted 485 unique images yielding 668 pair-wise comparisons... COCO: We manually collected 200 random images from COCO dataset [43]... Cluttered Parse: ...a test set of 40 challenging object pairs from 40 unique internet images... 4.2 Full Scene Decomposition ...we further collected 56 unique scenes...
Hardware Specification	Yes	Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: Details in supplementary. For every paired object comparison, our method requires about 6 minutes on an A6000 NVIDIA GPU.
Software Dependencies	Yes	To get object masks in the scene, we use off-the-shelf models. We first extract object coordinates using MOLMO [14] (Fig. 4a), and then use these as prompts for SAM 2 [60] to obtain segmentation maps without class labels (Fig. 4b). To compute it, we first gather N different inpaintings of A, denoted cj new for j [1, N], using Runway s checkpoint of Stable Diffusion 1.5 [64]. ...we use an off-the-shelf object remover, Adobe Firefly [2] (Fig. 4d). Appendix D Inpainting details: Since the time of our paper, the original Runway s checkpoint has been deprecated, but there are alternate mirrored third-party versions: https://huggingface.co/stable-diffusion-v1-5/stable-diffusioninpainting.
Experiment Setup	Yes	(ii). Obtaining reliable conditional probability. ... we first gather N different inpaintings of A, denoted cj new for j [1, N], using Runway s checkpoint of Stable Diffusion 1.5 [64]... We then quantify how semantically diverse these N inpaintings are using both CLIP [59] and DINO [53] features. 4.4 Ablation studies Effect of the number of inpainting samples. The larger number of inpaintings (N) helps better capture the distribution of possible scene completions and monotonically increases the performance, but with diminishing returns beyond N = 8 (see Appendix F). We used N = 16 by default.