Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Correcting misinterpretations of additive models

Authors: Benedict Clark, Rick Wilming, Hjalmar Schulz, Rustam Zhumagambetov, Danny Panknin, Stefan Haufe

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations on the XAI-TRIS benchmark with a novel false-negative invariant formulation of the earth mover s distance accuracy metric demonstrates significant improvements over popular feature attribution methods and the traditional interpretation of additive models. Finally, real-world case studies on the COMPAS and MIMIC-IV datasets provide new insights into the role of specific features by disentangling genuine target-related information from suppression effects that would mislead conventional GAM interpretations.
Researcher Affiliation	Academia	1Physikalisch-Technische Bundesanstalt, Berlin, Germany 2Technische Universität Berlin, Germany 3Charité Universitätsmedizin, Berlin, Germany
Pseudocode	No	The paper describes methodologies in prose within sections like '3 Methodology' but does not present them in a formally structured pseudocode or algorithm block.
Open Source Code	Yes	Anonymised code is available at https://github.com/braindatalab/pattern-gam with the GPL-3.0 license.
Open Datasets	Yes	The XAI-TRIS datasets (Clark et al., 2024b) are available to generate at https://github.com/braindatalab/xai-tris with the GPL-3.0 license, with fixed random seeds used for reproducibility. The COMPAS Recidivism data (Pro Publica, 2016) are available at https://github.com/propublica/compas-analysis, and we provide the specific data file used recid.data in the anonymised Git Hub repository. The MIMIC-IV dataset (Johnson et al., 2023b) is available via Physio Net (Johnson et al., 2023a) at https://physionet.org/content/mimiciv/2.0/, where training is required on subject handling before access can be granted. We use the v2.0 version of MIMIC-IV.
Dataset Splits	Yes	The XAI-TRIS library outputs the data pre-split three-fold into training, validation, and test data, defaulting to a 90/5/5 split respectively. However, all models except the MLP do not take the validation data as input, with the NAM and EBM implementations performing a validation split internally. As such, we concatenate the training and validation data as an input to all models other than the MLP. For the MIMIC-IV dataset: A stratified group-aware split is applied using a group shuffle split, ensuring that no subject (subject_id) appears in both training and test sets. After preprocessing, the final sample size is N = 23190 with 19616 unique patients and 3214 samples of deceased patients, split into training/validation/testing data, leading to a typical class imbalance for medical tasks.
Hardware Specification	Yes	All experiments are able to be computed on personal devices, where we have used an M4 Pro Mac Book Pro laptop for a lot of the prototyping and processing work involved. Due to the nature of the data and the lightweight models, no individual model takes more than a minute or two to train on such a laptop. Preprocessing the MIMIC-IV data takes slightly longer, however this is within the order of 30 to 60 minutes. For the XAI-TRIS model training line search and explanation calculation, we distribute jobs across a maximum of four NVIDIA A40 GPUs to parallelise the process of using multiple seeds and datasets, however these scripts would likely take around 6 to 18 hours if run locally and sequentially.
Software Dependencies	No	The paper mentions 'Pytorch implementation' for NAMs and 'scikit-learn' for Logistic Regression, but does not specify explicit version numbers for these software dependencies, which is required for a reproducible description.
Experiment Setup	Yes	We utilise the official Py Torch implementation of NAMs... The Adam optimiser is used with the default learning rate of 0.02082 and with a binary cross-entropy (BCE) loss function, training the model over a maximum of 100 epochs with a patience of 50 epochs. Output penalisation with the default value of 0.2078 is used to regularise smaller outputs of each subnetwork, similar to ridge regression. For Multi-Layer Perceptron (MLP): The model is trained for a maximum of 100 epochs with a batch size of 64, using the Adam optimiser with a learning rate of 1e-3 and a BCE loss function. Early stopping is implemented with a patience of 50 epochs based on the validation loss.