Shapley Based Residual Decomposition for Instance Analysis

Authors: Tommy Liu, Amanda S Barnard

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate with examples of testing data how previously unknown samples can be selected and how or why they may be interesting by means of the novel CC plot, along with the contribution and composition quantities.The paper focuses on the possible applications that such a framework brings to the relatively unexplored field of instance analysis in the context of Explainable AI tasks.
Researcher Affiliation Academia Tommy Liu 1 Amanda Barnard 1 1School of Computing, Australian National University, Canberra, Australia. Correspondence to: Tommy Liu <tommy.liu@anu.edu.au>.
Pseudocode No The paper describes algorithms such as the (truncated) permutation sampling based Monte Carlo algorithm and Kernel SHAP, and provides formulas (Equation 2, 3, 4), but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code for this project can be found at github.com/uilymmot/residual-decomposition.
Open Datasets Yes We provide additional CC plots on several machine learning data-sets in Appendix A where the differences between the Ridge and Random Forest regressors are even more pronounced. Furthermore, these instances tend to have a high contribution value meaning that on average they tend to drive the average residual value upward. It is also the case that for regression-based models, instances that lie further away from others (i.e. high leverage points) tend to have a greater effect on the model (Cook & Weisberg, 1982). We also observe individual instances that lie far away from the main group despite having similar Y-values, it may be of interest to analyze these samples further. We provide additional CC plots on several machine learning data-sets in Appendix A where the differences between the Ridge and Random Forest regressors are even more pronounced.
Dataset Splits No The paper mentions "training set" and "testing set" but does not provide specific details on the dataset splits (e.g., percentages, sample counts, or explicit cross-validation folds) needed to reproduce the experiment's data partitioning.
Hardware Specification Yes The total run-time for the Monte Carlo permutation sampling with the Random Forest (n=100) in the symmetric case was 14 hours on an i7-10700 @ 2.9GHz, while the Ridge Regression took approximately 6 minutes illustrating the significance of model complexity on the run-time of Shapley based approaches over instances.
Software Dependencies No The paper mentions various software components and models like "Ridge Regression", "Random Forest Regressor", "Isolation Forest", "Kernel SHAP", "Py Torch", "Tensor Flow", "scikit-learn", and "SHAP package". However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes The total run-time for the Monte Carlo permutation sampling with the Random Forest (n=100) in the symmetric case was 14 hours on an i7-10700 @ 2.9GHz, while the Ridge Regression took approximately 6 minutes illustrating the significance of model complexity on the run-time of Shapley based approaches over instances.