Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value
Authors: Yongchan Kwon, James Zou
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct comprehensive experiments using 12 classification datasets, each with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications. |
| Researcher Affiliation | Collaboration | 1Columbia University 2Stanford University 3Amazon AWS. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our Python-based implementation codes are publicly available at https://github.com/ykwon0407/dataoob. |
| Open Datasets | Yes | The covertype dataset is downloaded via the Python package scikit-learn (Pedregosa et al., 2011), and every other dataset is downloaded from Open ML (Feurer et al., 2021). Table 2 shows a summary of classification datasets. |
| Dataset Splits | Yes | We split it into the three datasets, namely, a training dataset, a validation dataset, and a test dataset. ... The training dataset size n is either 1000 or 10000, and the validation size is fixed to 10% of the training sample size. The test dataset size is fixed to 3000. |
| Hardware Specification | Yes | We measure the elapsed time with a single Intel Xeon E5-2640v4 CPU processor. |
| Software Dependencies | No | The paper mentions 'scikit-learn' but does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | For KNN Shapley, the only hyperparameter is the number of nearest neighbors. Since there is no optimal fixed number for hyperparameter, we set it to be 10% of the sample size n motivated by Jia et al. (2019a). ... For AME, we set the number of utility evaluations to be 800. ... The proposed method fits a random forest model with B = 800 decision trees using scikit-learn. |