Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GenVP: Generating Visual Puzzles with Contrastive Hierarchical VAEs

Authors: Kalliopi Basioti, Pritish Sahu, Qingze Liu, Zihao Xu, Hao Wang, Vladimir Pavlovic

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on five different datasets indicate that Gen VP achieves state-of-the-art (SOTA) performance both in puzzle-solving accuracy and out-of-distribution (OOD) generalization in 22 OOD scenarios.
Researcher Affiliation	Collaboration	Kalliopi Basioti1, {Pritish Sahu2 , Qingze Tony Liu1}, Zihao Xu1, Hao Wang1, Vladimir Pavlovic1 1Rutgers University, 2SRI International EMAIL EMAIL, EMAIL
Pseudocode	No	The paper describes the generative and inference models with equations and textual descriptions, and presents a graphical model in Figure 1, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code for the methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	We assessed Gen VP with the RAVEN-based (RAVEN (Zhang et al., 2019a), I-RAVEN (Hu et al., 2021) and RAVEN-FAIR (Benny et al., 2021)) and the VAD (Hill et al., 2019) and PGM (Barrett et al., 2018).
Dataset Splits	Yes	Each training set consists of 1.2 million puzzles, and each testing set consists of 200,000 puzzles. The training set consists of 600K examples, and the testing set consists of 100K.
Hardware Specification	Yes	All the models are trained on a server with 24GB NVIDIA RTX A5000 GPUs, 512GM RAM, and Ubuntu 20.04. For the efficiency and scalability evaluations, we used a server with characteristics of 48GB NVIDIA RTX A6000 GPUs and Dual AMD EPYC 7352 @ 2.3GHz = 48 cores, 96 v Cores CPU.
Software Dependencies	No	The paper mentions 'Adam W algorithm (Loshchilov & Hutter, 2017)' and 'Py Torch' but does not specify their version numbers.
Experiment Setup	Yes	In both cases, we used the Adam W algorithm (Loshchilov & Hutter, 2017) with a learning rate 10 4. We set the batch size to B = {RAVEN-based: 100, PGM: 400, VAD: 400} RPM puzzles, which means that we use B valid puzzles for ELBO and global contrasting and a batch size of A = {RAVEN-based: 7, PGM: 7, VAD: 3} for the local contrasting loss. For the β hyperparameters we set them to β1 = 1, β2 = 0, β3 = 1, β4 = 1, β5 = 1, β6 = 1, βR = 250, βG = 20, βL = 20.