Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Framework for Evaluating Faithfulness of Local Explanations

Authors: Sanjoy Dasgupta, Nave Frost, Michal Moshkovitz

ICML 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we experimentally validate the new properties and estimators. Summary of contributions: Empirical evaluation of these measures and estimators. In this section, we consider a scenario where we are given a black-box explanation system (f, e) and wish to evaluate its faithfulness. To this end, we develop statistical estimators for consistency and sufficiency given samples x1, . . . , xn from an underlying test distribution µ. 5. Experiments We begin with experiments that illustrate basic properties of our faithfulness estimators
Researcher Affiliation	Academia	1University of California San Diego. 2Tel-Aviv University. Correspondence to: Sanjoy Dasgupta <EMAIL>, Nave Frost <EMAIL>, Michal Moshkovitz <EMAIL>.
Pseudocode	Yes	C. Local estimators In this section we explore estimators for the local measures. Namely, Algorithm 1 estimates the local consistency and sufficiency measures of explainer e for model f at instance x. ... Algorithm 1 Estimating local consistency and sufficiency
Open Source Code	No	The paper does not provide any specific links or statements indicating the availability of open-source code for the methodology described.
Open Datasets	Yes	Highlighted text. To evaluate a variety of highlighted text explainers, we began by training a predictor on the rt-polaritydata dataset, used for sentiment classification of movie reviews, with 10,433 documents. ... Decision trees. ... on the Adult dataset (Kohavi et al., 1996)... The analysis is conducted on six standard datasets (described in Appendix D.1). D.1. Datasets Datasets in the empirical evaluation are depicted in Table 3. Heart (Janosi et al., 1989) Chess (Dua & Graff, 2017) Avila (De Stefano et al., 2018) Bank marketing (Moro et al., 2014) Adult (Kohavi et al., 1996) Covtype (Blackard & Dean, 1999) rt-polaritydata (Pang & Lee, 2005)
Dataset Splits	Yes	We used 80% of the data to train a linear model. ... using 66.6% of the examples for training. From the remaining 33.3% of the examples we varied the number of sampled records used to estimate consistency/sufficiency (the two estimates are identical in this setting). D.2. Model training In sections 5.2 and 5.3 we have explained gradient boosted trees models trained over 6 datasets. For each dataset, 66% of it was used for model training and cross-validation. Hyper-parameters were selected based on best mean accuracy over 3 cross-validation executions.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments, such as CPU or GPU models, or memory specifications.
Software Dependencies	No	The paper does not specify the versions of any software libraries, frameworks, or programming languages used (e.g., Python, PyTorch, TensorFlow, scikit-learn), which would be necessary for reproducible replication of the environment.
Experiment Setup	Yes	D.2. Model training In sections 5.2 and 5.3 we have explained gradient boosted trees models trained over 6 datasets. For each dataset, 66% of it was used for model training and cross-validation. Hyper-parameters were selected based on best mean accuracy over 3 cross-validation executions. The considered hyper-parameters are all combinations of the following: learning rate: 2^-5, 2^-4, . . . , 2^2. n estimators: 50, 100, 150, 200, 250, 300. max depth: 3, 4, 5, 6, 7. The selected hyper-parameters and test accuracy is presented in Table 4.