Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A-I-RAVEN and I-RAVEN-Mesh: Two New Benchmarks for Abstract Visual Reasoning

Authors: Mikołaj Małkiński, Jacek Mańdziuk

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate 13 strong models from the AVR literature on the introduced datasets, revealing their specific shortcomings in generalization and knowledge transfer.
Researcher Affiliation	Academia	1Warsaw University of Technology, Warsaw, Poland 2AGH University of Krakow, Krakow, Poland EMAIL, EMAIL
Pseudocode	No	The paper describes methods and processes but does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	The code for reproducing all experiments is publicly accessible at: https://github.com/mikomel/raven
Open Datasets	Yes	First, we introduce Attributeless-I-RAVEN (A-I-RAVEN), comprising 10 generalization regimes. Next, we propose I-RAVEN-Mesh, a variant of I-RAVEN with a new grid-like structure overlaid on the matrices. The released code allows for generation of all datasets from scratch, eliminating the dependency on file-hosting services required to distribute the data.
Dataset Splits	Yes	In each experiment, we utilize 42 000 training, 14 000 validation, and 14 000 test matrices, following the standard data split protocol taken in prior works [Zhang et al., 2019a; Hu et al., 2021].
Hardware Specification	Yes	Experiments were run on a worker with a single NVIDIA DGX A100 GPU.
Software Dependencies	No	The paper mentions using the Adam optimizer with specific parameters and that the training job is packaged as a Docker image with fixed dependencies, but it does not explicitly list software dependencies with specific version numbers within the text.
Experiment Setup	Yes	In all experiments we use the Adam optimizer [Kingma and Ba, 2014] with β1 = 0.9, β2 = 0.999, ϵ = 10 8 and a batch size set to 128. Learning rate is initialized to 0.001 and reduced 10-fold (at most 3 times) if no progress is seen in the validation loss in 5 subsequent epochs, and training stops early in the case of 10 epochs without progress.