Use-Case-Grounded Simulations for Explanation Evaluation
Authors: Valerie Chen, Nari Johnson, Nicholay Topin, Gregory Plumb, Ameet Talwalkar
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We run a comprehensive evaluation on three real-world use cases (forward simulation, model debugging, and counterfactual reasoning) to demonstrate that Sim Evals can effectively identify which explanation methods will help humans for each use case. |
| Researcher Affiliation | Academia | Valerie Chen Nari Johnson Nicholay Topin Gregory Plumb Ameet Talwalkar Carnegie Mellon University Correspondence to valeriechen@cmu.edu. |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Tutorial on how to run your own Sim Evals: https://github.com/valeriechen/simeval_tutorial |
| Open Datasets | Yes | A base dataset (D = {(x, y)}), on which the explanation method s utility will be evaluated. For example, the data generation process for the forward simulation use case from Hase and Bansal [16]. (Left) For this particular information content type, each observation contains the data-point xi and the model explanation for that datapoint E(f, xi). The three use-case-specific components are the base dataset D, base model family F, and function that defines a use case label. (Right) These components are used to generate a dataset of N observations. (x1 , E( x1 , f )), f(x1) (x N , E( x N , f )), f(x N ) (x2 , E( x2 , f )), f(x2) xi , yi From : UCI Adult |
| Dataset Splits | Yes | Following the same procedure as Hase and Bansal, we split D into train, test, and validation sets. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., CPU, GPU models, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions various explanation methods and models like LIME, SHAP, GAM, LightGBM, but it does not provide specific version numbers for any software dependencies needed to reproduce the experiments. |
| Experiment Setup | Yes | We include training details for our Sim Evals in Appendix F, extended results (with error bars) and comparisons with human results in Appendix G. |