Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

When Stability meets Sufficiency: Informative Explanations that do not Overwhelm

Authors: Ronny Luss, Amit Dhurandhar

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate these claims, both qualitatively and quantitatively, with experiments that show the beneﬁt of PSEM across three modalities (image, tabular and text) as well as versus other path explanations. A user study depicts the strength of the method in communicating the local behavior, where (many) users are able to correctly determine the prediction made by a model.
Researcher Affiliation	Industry	Ronny Luss EMAIL IBM Research, Yorktown Heights Amit Dhurandhar EMAIL IBM Research, Yorktown Heights
Pseudocode	Yes	Algorithm 1 Path-Suﬃcient Explanations Method (PSEM)
Open Source Code	No	The text states: "The PSEM implementation adapts CEM-PP code from https://github.com/IBM/AIX360." This indicates they adapted existing open-source code but does not explicitly state that their specific PSEM implementation or its modifications are made publicly available or provide a direct link to their own code repository.
Open Datasets	Yes	The HELOC dataset FICO (2018) contains credit applicant data... The Celeb A (Liu et al., 2015) dataset contains images... The MNIST dataset is comprised of handwritten digit images... The 20 Newsgroups dataset contains text documents...
Dataset Splits	No	The paper mentions various datasets (HELOC, Celeb A, MNIST, 20 Newsgroups) and notes test accuracy for some models, but it does not provide specific train/test/validation split percentages, sample counts, or references to predefined splits needed to reproduce the data partitioning for any of the datasets.
Hardware Specification	No	All experiments used 1 GPU and up to 16 GB RAM.
Software Dependencies	No	The paper mentions adapting CEM-PP code and implementing IR, but it does not specify version numbers for any key software components or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Table 5: Parameters used for various experiments Dataset β η N κ MNIST {0.0001, 0.001, 0.01, 0.1, 1.0} 10.0 5 0.75 HELOC {0.00001, 0.0001, 0.001, 0.01, 0.1} 30.0 5 0.2 Celeb A {0.001, 0.005, 0.01, 0.05} 0.01 4 0.02 20 Newsgroups {0.0001, 0.0005, 0.001, 0.005, .1} 50.0 5 0.5