reproducibilityindex.ai

Explaining Probabilistic Models with Distributional Values

Authors: Luca Franceschi, Michele Donini, Cedric Archambeau, Matthias Seeger

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	we showcase applications to image classifier and autoregressive language models (Section 5). We train a random forest binary classifier f on the Adult income dataset (Appendix D.5).
Researcher Affiliation	Industry	1Amazon Web Services, Berlin, Germany 2Helsing, Berlin, Germany. Correspondence to: Luca Franceschi <franuluc@amazon.de>.
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Python code is available at https://github.com/amazon-science/ explaining-probabilistic-models-with-distributinal-values.
Open Datasets	Yes	Iris dataset, MNIST (Le Cun et al., 1998), Image Net (Deng et al., 2009), Adult income dataset
Dataset Splits	No	For concreteness, we take as running examples the tasks of explaining the output of a logistic multiclass classifier f(x) = Softmax(x W + b) trained on the Iris dataset and the XOR game of Example 3.5. Test images from MNIST (Le Cun et al., 1998) and Image Net (Deng et al., 2009). We train a random forest binary classifier f on the Adult income dataset and compute the Bernoulli Shapley value (BSV) for one misclassified test instance. The paper uses standard datasets but does not specify the exact training/validation/test splits used for reproducibility.
Hardware Specification	Yes	We run all the experiments on a machine with 8 Intel(R) Xeon(R) Platinum 8259CL CPUs @ 2.50GHz and one Nvidia(R) Tesla(R) V4 GPU.
Software Dependencies	No	Python code is available at https://github.com/amazon-science/ explaining-probabilistic-models-with-distributinal-values. The paper does not specify versions for Python libraries or other software dependencies.
Experiment Setup	Yes	To compute both the standard and Categorical SV, we use a simple permutation-based 1000-samples Monte Carlo estimator (Strumbelj & Kononenko, 2010). For out-of-coalition pixels, we use a reference value of 0. We compute average categorical differences between output given prompts with female versus male subject. We restrict the output to a number of tokens in the order of 100 (depending on the sentence), picking a mix of manually selected, most probable (for a GPT2 model) and Chat GPT generated short continuations.