Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Designing Decision Support Systems using Counterfactual Prediction Sets

Authors: Eleni Straitouri, Manuel Gomez Rodriguez

ICML 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a large-scale human subject study (n = 2,751) to compare our methodology to several competitive baselines.
Researcher Affiliation	Academia	1Max Planck Institute for Software Systems, Kaiserslautern, Germany.
Pseudocode	Yes	Algorithm 1 Counterfactual Successive Elimination
Open Source Code	Yes	An open-source implementation of both the strict and the lenient implementation of our system as well as all the data gathered in our human subject study, which we refer to as Image Net16H-PS, are available at https://github.c om/Networks-Learning/counterfactual-p rediction-sets.
Open Datasets	Yes	To construct our dataset Image Net16H-PS, we gathered 194,407 label predictions from 2,751 human participants for 1,200 unique images from the Image Net16H dataset (Steyvers et al., 2022) using Prolific. Our experimental protocol received approval from the Institutional Review Board (IRB) at the University of Saarland, each participant was rewarded with 9 per hour pro-rated, following Prolific s payment principles, and consented to participate by filling a consent form that included a detailed description of the study processes, and the collected data did not include any personally identifiable information.
Dataset Splits	Yes	we used always the same classifier, namely the pre-trained VGG-19 (Simonyan & Zisserman, 2015) after 10 epochs of fine-tuning as provided by Steyvers et al. (2022) and a fixed calibration set of 120 images, picked at random.
Hardware Specification	Yes	All experiments ran on a Mac OS machine with an M1 processor and 16GB memory.
Software Dependencies	Yes	We implemented our algorithms on Python 3.10.9 using the following libraries: Num Py 1.24.1 (BSD-3-Clause License). Pandas 1.5.3 (BSD-3-Clause License). Scikit-learn 1.2.2 (BSD License).
Experiment Setup	Yes	For reproducibility, we used a fixed random seed in all random procedures (a different one) for each realization of the algorithms. Similarly, we used a fixed random seed to randomly pick the 120 images of the calibration set. In our study, we used always the same classifier, namely the pre-trained VGG-19 (Simonyan & Zisserman, 2015) after 10 epochs of fine-tuning as provided by Steyvers et al. (2022) and a fixed calibration set of 120 images, picked at random.