Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

NL-Eye: Abductive NLI For Images

Authors: Mor Ventura, Michael Toker, Nitay Calderon, Zorik Gekhman, Yonatan Bitton, Roi Reichart

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To address this, we introduce NL-EYE, a benchmark designed to assess VLMs visual abductive reasoning skills. [...] Our experiments show that VLMs struggle significantly on NL-EYE, often performing at random baseline levels, while humans excel in both plausibility prediction and explanation quality.
Researcher Affiliation Collaboration Mor Ventura1 Michael Toker1 Nitay Calderon1 Zorik Gekhman1,2 Yonatan Bitton2 Roi Reichart1 1Technion 2Google Research
Pseudocode No The paper describes methods and workflows (e.g., data curation workflow in Figure 6) but does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes 1Our data and code are available at https://venturamor.github.io/NLEye/. [...] To ensure the reproducibility of our results and promote further research, we will publicly release the NL-EYE benchmark, along with the code.
Open Datasets Yes To address this, we introduce NL-EYE, a benchmark designed to evaluate visual abductive reasoning capabilities of VLMs across multiple images. [...] 1Our data and code are available at https://venturamor.github.io/NLEye/.
Dataset Splits No Joining recent efforts in evaluating VLMs with an emphasis on the quality of test sets over their sheer size [...] we carefully curated 350 test set examples.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running the experiments. It mentions using API versions of closed-source models, implying the hardware is on the provider's side.
Software Dependencies No The paper lists API versions of closed-source models (Table 15) and names of open-source models (LLa VA 1.6, Fuyu, De BERTA-v3, BART-L) but does not provide specific version numbers for ancillary software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other solvers/frameworks used for the experiments.
Experiment Setup No The paper details evaluation metrics and input strategies (e.g., Likert scale, consistency accuracy) for the experiments. However, it does not provide specific training-related hyperparameters (e.g., learning rate, batch size, number of epochs) or system-level training settings, as it primarily evaluates pre-trained models rather than training new ones within the scope of the paper's experiments.