Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval

Authors: Nikolaos Chaidos, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Stamou

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate for Graph Edit Distance (GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval. The source code is available at https://github.com/nickhaidos/sceniricml2025. 4. Experiments Datasets For our experiments we leverage the PSG scene graph dataset (Yang et al., 2022)... Ground Truth and Evaluation We employ approximate GED as the ground truth distance/similarity for evaluating our approach... Baselines We initially compare our proposed architecture, to Sot A pre-trained Vision and Vision-Language (VL) models, supervised GNNs, and basic GAEs. 4.1. Quantitative Results In Table 1 we present test set retrieval results... Ablation Studies To understand the contribution of each integrated architectural component, we conduct ablation experiments (Table 2)...
Researcher Affiliation	Academia	1Artificial Intelligence and Learning Systems Laboratory, National Technical University of Athens. Correspondence to: Nikolaos Chaidos <EMAIL>.
Pseudocode	No	The paper describes the model architecture and training process using textual descriptions, mathematical formulas, and diagrams (Figure 3), but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	The source code is available at https://github.com/nickhaidos/sceniricml2025.
Open Datasets	Yes	Datasets For our experiments we leverage the PSG scene graph dataset (Yang et al., 2022), a more curated version of the traditional scene graph dataset, Visual Genome (Krishna et al., 2017), as it is based on more advanced panoptic segmentation masks, containing almost 49K annotated image, caption and scene graph samples. We also experiment on images from Flickr30K (Young et al., 2014) to evaluate SCENIR in a real-world use case, where caption and scene-graph annotations are unavailable, requiring us to generate synthetic ones (details about dataset preprocessing in Appendix A).
Dataset Splits	Yes	We select 11K scene graphs for training, and 1K scene graphs for testing.
Hardware Specification	Yes	We utilize Py Torch Geometric (Fey & Lenssen, 2019) for supervised and unsupervised GNNs, training them on a single P100 GPU.
Software Dependencies	No	We utilize Py Torch Geometric (Fey & Lenssen, 2019) for supervised and unsupervised GNNs, training them on a single P100 GPU. The paper mentions PyTorch Geometric and references open-source libraries but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	Regarding the Graph Autoencoders, they were all trained for 30 epochs, with batch size 64, Adam W Optimizer (lr = 0.001, β1 = 0.9, β2 = 0.999, weight decay = 0.01), 1000 latent space dimension, 32 output dimension for the Edge Decoder, 768 output dimension for the Feature Decoder, and 1 output dimension for the Discriminator (real/fake). Concerning the models that employ adversarial training, we followed the training algorithm proposed in Pan et al. (2018), with two separate Adam W optimizers, one for the Discriminator, and one for the rest of the model parameters. Also, we used an Exponential Learning Rate Scheduler (γ = 0.95), and Loss Tradeoff terms in order to stabilize the training. The final chosen parameters where λ1 = 3, λ2 = 1/6 and λ3 = 1/3.