Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval
Authors: Nikolaos Chaidos, Angeliki Dimitriou, Maria Lymperaiou, Giorgos Stamou
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our model demonstrates superior performance across metrics and runtime efficiency, outperforming existing vision-based, multimodal, and supervised GNN approaches. We further advocate for Graph Edit Distance (GED) as a deterministic and robust ground truth measure for scene graph similarity, replacing the inconsistent caption-based alternatives for the first time in image-to-image retrieval evaluation. Finally, we validate the generalizability of our method by applying it to unannotated datasets via automated scene graph generation, while substantially contributing in advancing state-of-the-art in counterfactual image retrieval. The source code is available at https://github.com/nickhaidos/sceniricml2025. 4. Experiments Datasets For our experiments we leverage the PSG scene graph dataset (Yang et al., 2022)... Ground Truth and Evaluation We employ approximate GED as the ground truth distance/similarity for evaluating our approach... Baselines We initially compare our proposed architecture, to Sot A pre-trained Vision and Vision-Language (VL) models, supervised GNNs, and basic GAEs. 4.1. Quantitative Results In Table 1 we present test set retrieval results... Ablation Studies To understand the contribution of each integrated architectural component, we conduct ablation experiments (Table 2)... |
| Researcher Affiliation | Academia | 1Artificial Intelligence and Learning Systems Laboratory, National Technical University of Athens. Correspondence to: Nikolaos Chaidos <EMAIL>. |
| Pseudocode | No | The paper describes the model architecture and training process using textual descriptions, mathematical formulas, and diagrams (Figure 3), but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | The source code is available at https://github.com/nickhaidos/sceniricml2025. |
| Open Datasets | Yes | Datasets For our experiments we leverage the PSG scene graph dataset (Yang et al., 2022), a more curated version of the traditional scene graph dataset, Visual Genome (Krishna et al., 2017), as it is based on more advanced panoptic segmentation masks, containing almost 49K annotated image, caption and scene graph samples. We also experiment on images from Flickr30K (Young et al., 2014) to evaluate SCENIR in a real-world use case, where caption and scene-graph annotations are unavailable, requiring us to generate synthetic ones (details about dataset preprocessing in Appendix A). |
| Dataset Splits | Yes | We select 11K scene graphs for training, and 1K scene graphs for testing. |
| Hardware Specification | Yes | We utilize Py Torch Geometric (Fey & Lenssen, 2019) for supervised and unsupervised GNNs, training them on a single P100 GPU. |
| Software Dependencies | No | We utilize Py Torch Geometric (Fey & Lenssen, 2019) for supervised and unsupervised GNNs, training them on a single P100 GPU. The paper mentions PyTorch Geometric and references open-source libraries but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Regarding the Graph Autoencoders, they were all trained for 30 epochs, with batch size 64, Adam W Optimizer (lr = 0.001, β1 = 0.9, β2 = 0.999, weight decay = 0.01), 1000 latent space dimension, 32 output dimension for the Edge Decoder, 768 output dimension for the Feature Decoder, and 1 output dimension for the Discriminator (real/fake). Concerning the models that employ adversarial training, we followed the training algorithm proposed in Pan et al. (2018), with two separate Adam W optimizers, one for the Discriminator, and one for the rest of the model parameters. Also, we used an Exponential Learning Rate Scheduler (γ = 0.95), and Loss Tradeoff terms in order to stabilize the training. The final chosen parameters where λ1 = 3, λ2 = 1/6 and λ3 = 1/3. |