Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Explaining Explanations: Axiomatic Feature Interactions for Deep Networks

Authors: Joseph D. Janizek, Pascal Sturmfels, Su-In Lee

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Additionally, we find that our method is faster than existing methods when the number of features is large, and outperforms previous methods on existing quantitative benchmarks. ... We empirically evaluated our method against other methods using benchmarks inspired by recent literature on quantitatively evaluating feature attribution methods (Adebayo et al., 2018; Kindermans et al., 2019; Hooker et al., 2019; Yeh et al., 2019; Lin et al., 2019).
Researcher Affiliation	Academia	Joseph D. Janizek EMAIL Paul G. Allen School of Computer Science & Engineering University of Washington Seattle, WA 98195-4322, USA
Pseudocode	No	The paper provides mathematical derivations and discrete sum approximations for Integrated Hessians but does not present any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	We downloaded pre-trained weights for Distil BERT (Sanh et al., 2019) from the Hugging Face Transformers library (Wolf et al., 2019). ... The paper uses third-party libraries and models but does not explicitly state that its own implementation code is released or provide a link to a repository for the methodology described.
Open Datasets	Yes	We fine-tuned the model on the Stanford Sentiment Treebank data set (Socher et al., 2013)... We examined the Cleveland heart disease data set (Detrano et al., 1989; Das et al., 2009)... We utilized the HRTU2 data set, curated by Lyon et al. (2016) and originally gathered by Keith et al. (2010).
Dataset Splits	Yes	We split the data into 238 patients for training (of which 109 had coronary artery disease) and 60 for testing (of which 28 have coronary artery disease). ... We split the data into 14,318 training examples (1,365 are pulsars) and 3,580 testing examples (274 are pulsars)...
Hardware Specification	No	However, back-propagating the Hessian through the model is easily done in parallel on a GPU since this functionality already exists in modern deep learning frameworks (see Appendix B.3) (Paszke et al., 2019; Abadi et al., 2016). ... The paper mentions using a 'GPU' but does not specify any particular model or other hardware components.
Software Dependencies	No	Hugging Face Transformers library (Wolf et al., 2019). ... Adam algorithm... Adam optimizer in Py Torch... The Com Bat tool (a robust empirical Bayes regression implemented as part of the sva R package)... The paper mentions several software tools and libraries (Hugging Face Transformers, PyTorch, Adam optimizer, sva R package, TensorFlow) but does not provide specific version numbers for any of them.
Experiment Setup	Yes	We fine-tuned for 3 epochs using a batch size of 32 and a learning rate of 0.00003. We used a a max sequence length of 128 tokens, and the Adam algorithm for optimization... The convolutional neural network... trained with a batch size of 128 for 2 epochs and used a learning rate of 0.001. We optimized using the Adam algorithm... We used a two-layer neural network with 128 and 64 hidden units, respectively, with Soft Plus activation... We optimized using gradient descent... with an initial learning rate of 0.1 that decays exponentially with a rate 0.99 after each epoch. We used nesterov momentum with β = 0.9... After training for 200 epochs... We trained the network to optimize a mean squared error loss function and used the Adam optimizer in Py Torch with default hyperparameters and a learning rate equal to 10 5... stopped the training when mean squared error on the held-out validation set failed to improve over 10 epochs, and found that the network reached an optimum at 200 epochs.