Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

NL-Eye: Abductive NLI For Images

Authors: Mor Ventura, Michael Toker, Nitay Calderon, Zorik Gekhman, Yonatan Bitton, Roi Reichart

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address this, we introduce NL-EYE, a benchmark designed to assess VLMs visual abductive reasoning skills. [...] Our experiments show that VLMs struggle significantly on NL-EYE, often performing at random baseline levels, while humans excel in both plausibility prediction and explanation quality.
Researcher Affiliation	Collaboration	Mor Ventura1 Michael Toker1 Nitay Calderon1 Zorik Gekhman1,2 Yonatan Bitton2 Roi Reichart1 1Technion 2Google Research
Pseudocode	No	The paper describes methods and workflows (e.g., data curation workflow in Figure 6) but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	1Our data and code are available at https://venturamor.github.io/NLEye/. [...] To ensure the reproducibility of our results and promote further research, we will publicly release the NL-EYE benchmark, along with the code.
Open Datasets	Yes	To address this, we introduce NL-EYE, a benchmark designed to evaluate visual abductive reasoning capabilities of VLMs across multiple images. [...] 1Our data and code are available at https://venturamor.github.io/NLEye/.
Dataset Splits	No	Joining recent efforts in evaluating VLMs with an emphasis on the quality of test sets over their sheer size [...] we carefully curated 350 test set examples.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running the experiments. It mentions using API versions of closed-source models, implying the hardware is on the provider's side.
Software Dependencies	No	The paper lists API versions of closed-source models (Table 15) and names of open-source models (LLa VA 1.6, Fuyu, De BERTA-v3, BART-L) but does not provide specific version numbers for ancillary software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other solvers/frameworks used for the experiments.
Experiment Setup	No	The paper details evaluation metrics and input strategies (e.g., Likert scale, consistency accuracy) for the experiments. However, it does not provide specific training-related hyperparameters (e.g., learning rate, batch size, number of epochs) or system-level training settings, as it primarily evaluates pre-trained models rather than training new ones within the scope of the paper's experiments.