Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SHAP-IQ: Unified Approximation of any-order Shapley Interactions

Authors: Fabian Fumagalli, Maximilian Muschalik, Patrick Kolpaczki, Eyke Hüllermeier, Barbara Hammer

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate the computational efficiency and effectiveness by explaining language, image classification and high-dimensional synthetic models. We use SHAP-IQ to compute any-order n-Shapley Values on different ML models and demonstrate that it outperforms existing baseline methods. We conduct multiple experiments to illustrate the approximation quality of SHAP-IQ compared to current baseline approaches.
Researcher Affiliation Academia Fabian Fumagalli Bielefeld University, CITEC D-33619, Bielefeld, Germany EMAIL Maximilian Muschalik LMU Munich, MCML Munich D-80539, Munich, Germany EMAIL Patrick Kolpaczki Paderborn University D-33098, Paderborn, Germany EMAIL Eyke Hüllermeier LMU Munich, MCML Munich D-80539, Munich, Germany EMAIL Barbara Hammer Bielefeld University, CITEC D-33619, Bielefeld, Germany EMAIL
Pseudocode Yes Algorithm 1 SHAP-IQ for any-order interactions Ss0 up to order s0, Algorithm 2 Determine the the sampling order k0 for the deterministic part, Algorithm 3 Sample a subset T p(T), Algorithm 4 Welford Algorithm for Mean and Variance [45].
Open Source Code Yes 2The shapiq package extends on the well-known shap library and can be found at https://pypi.org/ project/shapiq/.
Open Datasets Yes For a language model (LM), we use a fine-tuned version of the Distil BERT transformer architecture [34] on movie review sentences from the original IMDB dataset [27, 22] for sentiment analysis... For an image classification model (ICM), we use ResNet18 [19] pre-trained on Image Net [11] as provided by torch [30].
Dataset Splits No The paper describes using pre-trained models (Distil BERT fine-tuned on IMDB, ResNet18 pre-trained on ImageNet) and randomly sampling instances for explanation. It does not provide specific train/validation/test splits for its own experimental setup.
Hardware Specification Yes The experiments concerning the approximation quality of SHAP-IQ compared to the baselines were run on an computation cluster on hyperthreaded Intel Xeon E5-2697 v3 CPUs clocking at with 2.6Ghz. Before running the experiments on the cluster, the implementations were validated on a Dell XPS 15 9510 containing an Intel i7-11800H at 2.30GHz.
Software Dependencies No No specific version numbers for key software dependencies like PyTorch, transformers, or scikit-image are explicitly provided in the text, only their names and citations.
Experiment Setup Yes We randomly sample 50 reviews of length d = 14 and explain each model prediction. To obtain the prediction of different coalitions, we pre-compute super-pixels with SLIC [1, 44] to obtain a function on d = 14 features and apply mean imputation on absent features. For a high-dimensional synthetic model with d = 30, we use a sum of unanimity model (SOUM) ν(T) := PN n=1 an1(Qn T), where N = 50 interaction subsets Q1, . . . , QN D are chosen uniformly from all subset sizes and a1, . . . , a N R are generated uniformly an unif([0, 1]). For each iteration we evaluate the approximation method with different budgets up to a maximum budget of 2^14 model evaluations.