Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value

Authors: Yongchan Kwon, James Zou

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct comprehensive experiments using 12 classification datasets, each with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.
Researcher Affiliation Collaboration 1Columbia University 2Stanford University 3Amazon AWS.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our Python-based implementation codes are publicly available at https://github.com/ykwon0407/dataoob.
Open Datasets Yes The covertype dataset is downloaded via the Python package scikit-learn (Pedregosa et al., 2011), and every other dataset is downloaded from Open ML (Feurer et al., 2021). Table 2 shows a summary of classification datasets.
Dataset Splits Yes We split it into the three datasets, namely, a training dataset, a validation dataset, and a test dataset. ... The training dataset size n is either 1000 or 10000, and the validation size is fixed to 10% of the training sample size. The test dataset size is fixed to 3000.
Hardware Specification Yes We measure the elapsed time with a single Intel Xeon E5-2640v4 CPU processor.
Software Dependencies No The paper mentions 'scikit-learn' but does not provide specific version numbers for software dependencies.
Experiment Setup Yes For KNN Shapley, the only hyperparameter is the number of nearest neighbors. Since there is no optimal fixed number for hyperparameter, we set it to be 10% of the sample size n motivated by Jia et al. (2019a). ... For AME, we set the number of utility evaluations to be 800. ... The proposed method fits a random forest model with B = 800 decision trees using scikit-learn.