Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
A-I-RAVEN and I-RAVEN-Mesh: Two New Benchmarks for Abstract Visual Reasoning
Authors: Mikołaj Małkiński, Jacek Mańdziuk
IJCAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate 13 strong models from the AVR literature on the introduced datasets, revealing their specific shortcomings in generalization and knowledge transfer. |
| Researcher Affiliation | Academia | 1Warsaw University of Technology, Warsaw, Poland 2AGH University of Krakow, Krakow, Poland EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and processes but does not contain a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | The code for reproducing all experiments is publicly accessible at: https://github.com/mikomel/raven |
| Open Datasets | Yes | First, we introduce Attributeless-I-RAVEN (A-I-RAVEN), comprising 10 generalization regimes. Next, we propose I-RAVEN-Mesh, a variant of I-RAVEN with a new grid-like structure overlaid on the matrices. The released code allows for generation of all datasets from scratch, eliminating the dependency on file-hosting services required to distribute the data. |
| Dataset Splits | Yes | In each experiment, we utilize 42 000 training, 14 000 validation, and 14 000 test matrices, following the standard data split protocol taken in prior works [Zhang et al., 2019a; Hu et al., 2021]. |
| Hardware Specification | Yes | Experiments were run on a worker with a single NVIDIA DGX A100 GPU. |
| Software Dependencies | No | The paper mentions using the Adam optimizer with specific parameters and that the training job is packaged as a Docker image with fixed dependencies, but it does not explicitly list software dependencies with specific version numbers within the text. |
| Experiment Setup | Yes | In all experiments we use the Adam optimizer [Kingma and Ba, 2014] with β1 = 0.9, β2 = 0.999, ϵ = 10 8 and a batch size set to 128. Learning rate is initialized to 0.001 and reduced 10-fold (at most 3 times) if no progress is seen in the validation loss in 5 subsequent epochs, and training stops early in the case of 10 epochs without progress. |