Data Debugging with Shapley Importance over Machine Learning Pipelines
Authors: Bojan Karlaš, David Dao, Matteo Interlandi, Sebastian Schelter, Wentao Wu, Ce Zhang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, our experimental evaluation demonstrates that our methods are capable of data error discovery that is as effective as existing Monte Carlo baselines, and in some cases even outperform them. We release our code as an open-source data debugging library available at github.com/easeml/datascope. |
| Researcher Affiliation | Collaboration | Bojan Karlaš1*, David Dao2, Matteo Interlandi3, Sebastian Schelter4, Wentao Wu3, Ce Zhang5 1Harvard University, 2ETH Zurich, 3Microsoft, 4University of Amsterdam, 5University of Chicago |
| Pseudocode | Yes | Algorithm 1 Compiling a provenance-tracked dataset into ADD. |
| Open Source Code | Yes | We release our code as an open-source data debugging library available at github.com/easeml/datascope. ... Our code is available at github.com/easeml/datascope. |
| Open Datasets | Yes | Datasets. We assemble a collection of widely used datasets with diverse modalities (i.e. tabular, textual, and image datasets). Table 2 summarizes the datasets that we used. ... UCI Adult (Kohavi et al., 1996), Folktables Adult (Ding et al., 2021), Fashion MNIST (Xiao et al., 2017), 20News Groups (Joachims, 1996), Data Perf Vision (Mazumder et al., 2022), CIFAR N (Wei et al., 2022). |
| Dataset Splits | Yes | We compute the importance using a validation dataset and use it to prioritize our label repairs. ... Dataset: CIFAR-N; Pipeline: Histogram of Oriented Gradients Target Model: Logistic Regression; Validation / Test Set Size: 5K |
| Hardware Specification | Yes | All experiments were conducted on an AMD EPYC 7742 2.25GHz CPU. We ran each experiment in single-thread mode. All deep learning models were running on an NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions software like 'dcbench', 'sklearn pipelines', and 'Logistic Regression and KNeighbors Classifier provided by the sklearn package'. However, it does not provide specific version numbers for these software dependencies (e.g., 'sklearn 0.24' or 'dcbench 1.0'). |
| Experiment Setup | Yes | If a dataset does not already have human-generated label errors, we follow the protocol of Li et al. (2021) and Jia et al. (2021) and artificially inject 50% of label noise. ... We set max_iter to 5,000 for Logistic Regression and set n_neighbors to 1 for KNearest Neighbor. ... We fine-tune it for 5 epochs on a noisy label dataset and see that Datascope KNN fares favorably compared to random label repair. |