Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
NL-Eye: Abductive NLI For Images
Authors: Mor Ventura, Michael Toker, Nitay Calderon, Zorik Gekhman, Yonatan Bitton, Roi Reichart
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address this, we introduce NL-EYE, a benchmark designed to assess VLMs visual abductive reasoning skills. [...] Our experiments show that VLMs struggle significantly on NL-EYE, often performing at random baseline levels, while humans excel in both plausibility prediction and explanation quality. |
| Researcher Affiliation | Collaboration | Mor Ventura1 Michael Toker1 Nitay Calderon1 Zorik Gekhman1,2 Yonatan Bitton2 Roi Reichart1 1Technion 2Google Research |
| Pseudocode | No | The paper describes methods and workflows (e.g., data curation workflow in Figure 6) but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Our data and code are available at https://venturamor.github.io/NLEye/. [...] To ensure the reproducibility of our results and promote further research, we will publicly release the NL-EYE benchmark, along with the code. |
| Open Datasets | Yes | To address this, we introduce NL-EYE, a benchmark designed to evaluate visual abductive reasoning capabilities of VLMs across multiple images. [...] 1Our data and code are available at https://venturamor.github.io/NLEye/. |
| Dataset Splits | No | Joining recent efforts in evaluating VLMs with an emphasis on the quality of test sets over their sheer size [...] we carefully curated 350 test set examples. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running the experiments. It mentions using API versions of closed-source models, implying the hardware is on the provider's side. |
| Software Dependencies | No | The paper lists API versions of closed-source models (Table 15) and names of open-source models (LLa VA 1.6, Fuyu, De BERTA-v3, BART-L) but does not provide specific version numbers for ancillary software dependencies like programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other solvers/frameworks used for the experiments. |
| Experiment Setup | No | The paper details evaluation metrics and input strategies (e.g., Likert scale, consistency accuracy) for the experiments. However, it does not provide specific training-related hyperparameters (e.g., learning rate, batch size, number of epochs) or system-level training settings, as it primarily evaluates pre-trained models rather than training new ones within the scope of the paper's experiments. |