reproducibilityindex.ai

Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value

Authors: Yongchan Kwon, James Zou

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct comprehensive experiments using 12 classification datasets, each with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.
Researcher Affiliation	Collaboration	1Columbia University 2Stanford University 3Amazon AWS.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our Python-based implementation codes are publicly available at https://github.com/ykwon0407/dataoob.
Open Datasets	Yes	The covertype dataset is downloaded via the Python package scikit-learn (Pedregosa et al., 2011), and every other dataset is downloaded from Open ML (Feurer et al., 2021). Table 2 shows a summary of classification datasets.
Dataset Splits	Yes	We split it into the three datasets, namely, a training dataset, a validation dataset, and a test dataset. ... The training dataset size n is either 1000 or 10000, and the validation size is fixed to 10% of the training sample size. The test dataset size is fixed to 3000.
Hardware Specification	Yes	We measure the elapsed time with a single Intel Xeon E5-2640v4 CPU processor.
Software Dependencies	No	The paper mentions 'scikit-learn' but does not provide specific version numbers for software dependencies.
Experiment Setup	Yes	For KNN Shapley, the only hyperparameter is the number of nearest neighbors. Since there is no optimal fixed number for hyperparameter, we set it to be 10% of the sample size n motivated by Jia et al. (2019a). ... For AME, we set the number of utility evaluations to be 800. ... The proposed method fits a random forest model with B = 800 decision trees using scikit-learn.