Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Choose your Data Wisely: A Framework for Semantic Counterfactuals

Authors: Edmund Dervakos, Konstantinos Thomas, Giorgos Filandrianos, Giorgos Stamou

IJCAI 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	For evaluating the proposed framework, we conducted four experiments, each with a different purpose. The first is a user study for comparing our work with a state-of-the-art image counterfactual system, which was performed on the CUB dataset [Wah et al., 2011].
Researcher Affiliation	Academia	Edmund Dervakos1 , Konstantinos Thomas1 , Giorgos Filandrianos1 and Giorgos Stamou1 1National Technical University of Athens EMAIL, EMAIL,
Pseudocode	No	The paper describes the algorithm for computing counterfactual explanations in text (Section 4) but does not provide structured pseudocode or algorithm blocks (e.g., labeled 'Algorithm 1').
Open Source Code	Yes	Further details about this experiment are available in the supplementary material2.(Footnote 2: https://github.com/geofila/Semantic Counterfactuals/blob/main/Supplementary%20Material.pdf. The base GitHub repository provides the source code for the framework.)
Open Datasets	Yes	The first is a user study for comparing our work with a state-of-the-art image counterfactual system, which was performed on the CUB dataset [Wah et al., 2011].For our second experiment, we decided to explore more realistic examples and took advantage of the COCO dataset, which contains object-annotated, real-world images that can automatically be linked to an external knowledge.For our cross-checking dataset, we will use Visual Genome since it is, along with COCO, one of the very few datasets containing annotated images.We provide explanations for a classifier that was trained on a subset of the Coswara Dataset, specifically, the winning entry of the IEEE COVID-19 sensor informatics challenge 5.As an explanation dataset, we used data from the Smarty4covid platform 6, which contains similar audio files and includes additional annotations, such as gender, symptoms, medical history, etc., in the form of an ontology.
Dataset Splits	No	The paper mentions using the 'test set of CUB' for predictions but does not explicitly provide details about training, validation, or test splits for any dataset used in their own experimental setup (e.g., percentages, sample counts, or methodology for splitting).
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., exact GPU/CPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies	No	The paper mentions software like 'networkx python package' and 'NLTK python package' but does not specify their version numbers, which are required for reproducible software dependencies.
Experiment Setup	No	While the paper describes the setup for its human study and the data used, it does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size, epochs), optimizer settings, or other system-level training configurations for any models.