Data Debugging with Shapley Importance over Machine Learning Pipelines

Authors: Bojan Karlaš, David Dao, Matteo Interlandi, Sebastian Schelter, Wentao Wu, Ce Zhang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, our experimental evaluation demonstrates that our methods are capable of data error discovery that is as effective as existing Monte Carlo baselines, and in some cases even outperform them. We release our code as an open-source data debugging library available at github.com/easeml/datascope.
Researcher Affiliation Collaboration Bojan Karlaš1*, David Dao2, Matteo Interlandi3, Sebastian Schelter4, Wentao Wu3, Ce Zhang5 1Harvard University, 2ETH Zurich, 3Microsoft, 4University of Amsterdam, 5University of Chicago
Pseudocode Yes Algorithm 1 Compiling a provenance-tracked dataset into ADD.
Open Source Code Yes We release our code as an open-source data debugging library available at github.com/easeml/datascope. ... Our code is available at github.com/easeml/datascope.
Open Datasets Yes Datasets. We assemble a collection of widely used datasets with diverse modalities (i.e. tabular, textual, and image datasets). Table 2 summarizes the datasets that we used. ... UCI Adult (Kohavi et al., 1996), Folktables Adult (Ding et al., 2021), Fashion MNIST (Xiao et al., 2017), 20News Groups (Joachims, 1996), Data Perf Vision (Mazumder et al., 2022), CIFAR N (Wei et al., 2022).
Dataset Splits Yes We compute the importance using a validation dataset and use it to prioritize our label repairs. ... Dataset: CIFAR-N; Pipeline: Histogram of Oriented Gradients Target Model: Logistic Regression; Validation / Test Set Size: 5K
Hardware Specification Yes All experiments were conducted on an AMD EPYC 7742 2.25GHz CPU. We ran each experiment in single-thread mode. All deep learning models were running on an NVIDIA A100 GPU.
Software Dependencies No The paper mentions software like 'dcbench', 'sklearn pipelines', and 'Logistic Regression and KNeighbors Classifier provided by the sklearn package'. However, it does not provide specific version numbers for these software dependencies (e.g., 'sklearn 0.24' or 'dcbench 1.0').
Experiment Setup Yes If a dataset does not already have human-generated label errors, we follow the protocol of Li et al. (2021) and Jia et al. (2021) and artificially inject 50% of label noise. ... We set max_iter to 5,000 for Logistic Regression and set n_neighbors to 1 for KNearest Neighbor. ... We fine-tune it for 5 epochs on a noisy label dataset and see that Datascope KNN fares favorably compared to random label repair.