Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

V-CECE: Visual Counterfactual Explanations via Conceptual Edits

Authors: Nikolaos Spanos, Maria Lymperaiou, Giorgos Filandrianos, Konstantinos Thomas, Athanasios Voulodimos, Giorgos Stamou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Throughout our experimentation, we showcase the explanatory gap between human reasoning and neural model behavior by utilizing both Convolutional Neural Network (CNN), Vision Transformer (Vi T) and Large Vision Language Model (LVLM) classifiers, substantiated through a comprehensive human evaluation. Project page and code are available at https://nickspanos55.github.io/vcece
Researcher Affiliation	Academia	Nikolaos Spanos Maria Lymperaiou Giorgos Filandrianos Konstantinos Thomas Athanasios Voulodimos Giorgos Stamou National Technical University of Athens EMAIL EMAIL, EMAIL
Pseudocode	No	The paper describes algorithms such as bipartite matching and the Hungarian algorithm, but these are described in narrative text and not presented in a structured pseudocode or algorithm block.
Open Source Code	Yes	Project page and code are available at https://nickspanos55.github.io/vcece
Open Datasets	Yes	Datasets We experiment with distinct datasets for which the semantics play a definitive role. First, we utilize BDD100K [43] that focuses on real-world autonomous driving situations... Moreover, following the state-of-the-art work on semantic counterfactuals [9], we replicate the Visual Genome experiments on the VG-Random subset2, upon which we generate the final images.
Dataset Splits	No	The paper mentions using BDD100K [43] and a VG-Random subset of Visual Genome, and states "To ensure a fair comparison of our results with other methods for the BDD100K dataset, we employ the same Dense Net-121 classifier as in [23]." However, it does not explicitly provide the training/test/validation dataset splits within the text of this paper.
Hardware Specification	Yes	All experiments are conducted on an L40S GPU (48GB) with an average memory usage of 70% (33.6GB). Technical details are provided in App. C
Software Dependencies	No	The paper mentions several models and samplers like 'Stable Diffusion v1.5 Inpainting model', 'Grounding DINO [28]', 'SAM (Segment Anything Model) [24]', and 'DPM++ 2M SDE sampler [34]'. However, it does not provide specific version numbers for software libraries or frameworks used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Each image is processed for 40 steps with the DPM++ 2M SDE sampler [34] and an automatically selected scheduler. We opt for the default random seed, which can be fixed for reproducibility, while we abstain from applying a variation seed to keep outputs consistent. A high-resolution fix is enabled, adding an extra upscaling pass that improves final image quality. In our configuration, object detection operates with a confidence threshold of 0.3, guiding the inclusion or exclusion of specific object classes via textual prompts. The bounding boxes around detected objects are expanded by 35 pixels, with a soft boundary applied using a mask blur of 10 pixels. For inpainting, the process adheres strictly to the provided guidance, with a classifier-free guidance scale of 10, instructing the model to strongly follow the given prompts. A denoising strength of 1 is used, ensuring the inpainted areas undergo full transformation based on the prompt.