Error Discovery By Clustering Influence Embeddings

Authors: Fulton Wang, Julius Adebayo, Sarah Tan, Diego Garcia-Olano, Narine Kokhlikyan

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show Inf Embed outperforms current state-of-the-art methods on 2 benchmarks, and is effective for model debugging across several case studies.2
Researcher Affiliation Collaboration Fulton Wang Meta Julius Adebayo Prescient Design / Genentech Sarah Tan Cornell University Diego Garcia-Olano Meta Narine Kokhlikyan Meta
Pseudocode Yes Algorithm 2 Our SDM, Inf Embed, applies K-Means to influence embeddings of test examples.
Open Source Code Yes Code to replicate our findings is available at: https://github.com/adebayoj/infembed
Open Datasets Yes dcbench [Eyuboglu et al., 2022a] provides 1235 pre-trained models that are derived from real-world data... The Spot Check benchmark Plumb et al. [2022]... the test split of Imagenet [Deng et al., 2009]... AGNews [Zhang et al., 2015]... bone-age classification 4https://www.kaggle.com/datasets/kmader/rsna-bone-age
Dataset Splits No The paper mentions using 'training dataset' and 'test dataset' from standard benchmarks like Imagenet and AGNews, and the Boneage dataset. While these benchmarks typically have predefined splits, the paper does not explicitly state the exact percentages or sample counts for training, validation, and test splits within its text, nor does it specify how any custom validation sets were created.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or cloud computing instance specifications used for running the experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, scikit-learn versions) required to reproduce the experiments.
Experiment Setup Yes For all experiments, we use Arnoldi dimention P = 500, and influence embedding dimension D = 100, unless noted otherwise. In the experiments that use Inf Embed-Rule, we used branching factor B=3. The rationale is that B should not be too large, to avoid unnecessarily dividing large slices with sufficient low accuracy into smaller slices. In practice, B=2 and B=3 did not give qualitatively different results.