reproducibilityindex.ai

Data Debugging with Shapley Importance over Machine Learning Pipelines

Authors: Bojan Karlaš, David Dao, Matteo Interlandi, Sebastian Schelter, Wentao Wu, Ce Zhang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, our experimental evaluation demonstrates that our methods are capable of data error discovery that is as effective as existing Monte Carlo baselines, and in some cases even outperform them. We release our code as an open-source data debugging library available at github.com/easeml/datascope.
Researcher Affiliation	Collaboration	Bojan Karlaš1*, David Dao2, Matteo Interlandi3, Sebastian Schelter4, Wentao Wu3, Ce Zhang5 1Harvard University, 2ETH Zurich, 3Microsoft, 4University of Amsterdam, 5University of Chicago
Pseudocode	Yes	Algorithm 1 Compiling a provenance-tracked dataset into ADD.
Open Source Code	Yes	We release our code as an open-source data debugging library available at github.com/easeml/datascope. ... Our code is available at github.com/easeml/datascope.
Open Datasets	Yes	Datasets. We assemble a collection of widely used datasets with diverse modalities (i.e. tabular, textual, and image datasets). Table 2 summarizes the datasets that we used. ... UCI Adult (Kohavi et al., 1996), Folktables Adult (Ding et al., 2021), Fashion MNIST (Xiao et al., 2017), 20News Groups (Joachims, 1996), Data Perf Vision (Mazumder et al., 2022), CIFAR N (Wei et al., 2022).
Dataset Splits	Yes	We compute the importance using a validation dataset and use it to prioritize our label repairs. ... Dataset: CIFAR-N; Pipeline: Histogram of Oriented Gradients Target Model: Logistic Regression; Validation / Test Set Size: 5K
Hardware Specification	Yes	All experiments were conducted on an AMD EPYC 7742 2.25GHz CPU. We ran each experiment in single-thread mode. All deep learning models were running on an NVIDIA A100 GPU.
Software Dependencies	No	The paper mentions software like 'dcbench', 'sklearn pipelines', and 'Logistic Regression and KNeighbors Classifier provided by the sklearn package'. However, it does not provide specific version numbers for these software dependencies (e.g., 'sklearn 0.24' or 'dcbench 1.0').
Experiment Setup	Yes	If a dataset does not already have human-generated label errors, we follow the protocol of Li et al. (2021) and Jia et al. (2021) and artificially inject 50% of label noise. ... We set max_iter to 5,000 for Logistic Regression and set n_neighbors to 1 for KNearest Neighbor. ... We fine-tune it for 5 epochs on a noisy label dataset and see that Datascope KNN fares favorably compared to random label repair.