2D-OOB: Attributing Data Contribution Through Joint Valuation Framework

Authors: Yifan Sun, Jingyan Shen, Yongchan Kwon

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our comprehensive experiments demonstrate that 2D-OOB achieves state-of-the-art performance across multiple use cases while being exponentially faster. Specifically, 2D-OOB shows promising results in detecting and rectifying fine-grained outliers at the cell level, and localizing backdoor triggers in data poisoning attacks. In this section, we empirically show the effectiveness of 2D-OOB across multiple use cases of the joint valuation: cell-level outlier detection, cell fixation, and backdoor trigger detection.
Researcher Affiliation Academia Yifan Sun University of Illinois Urbana-Champaign yifan50@illinois.edu Jingyan Shen Columbia University js5544@columbia.edu Yongchan Kwon Columbia University yk3012@columbia.edu
Pseudocode No The paper describes algorithms and formulations mathematically but does not include structured pseudocode blocks or algorithms labeled as such.
Open Source Code Yes Code repository can be found at https://github.com/yifansun99/2D-OOB-Joint-Valuation.
Open Datasets Yes We use 12 publicly accessible binary classification datasets from Open ML, encompassing a range of both low and high-dimensional datasets, which have been widely used in the literature [13, 25, 26]. Details on these datasets are presented in Appendix A.1. We select 5 pairs of classes from CIFAR-10 [24].
Dataset Splits Yes For each dataset, 1000 and 3000 data points are randomly sampled for training and test datasets, respectively. Note that for methods that need a validation dataset such as KNNShapley and Data Shapley, we additionally sample a separate validation dataset (disjoint from training dataset and test dataset) to evaluate the utility function. The size of the validation dataset is set to 10% of the training sample size.
Hardware Specification Yes The run time is measured on a single Intel Xeon Gold 6226 2.9 GHz CPU processor.
Software Dependencies No The paper mentions using "scikit-learn" for the subset random forest model, but does not provide specific version numbers for this or other software components.
Experiment Setup Yes Throughout all of our experiments, 2D-OOB uses a subset bagging model with B = 1000 decision trees. We randomly select a fixed ratio of features to build each decision tree. Unless otherwise specified, we utilize half of the features for each weak learner and set T(yi, ˆf(xi,Sb)) = 1(yi = ˆf(xi,Sb)). For the baseline method, we consider 2D-KNN, a fast and performant variant of 2D-Shapley [32]. We incorporate a distance regularization term in the utility function T for enhanced performance. For 2D-KNN, we set the number of nearest neighbors as 10 and the number of permutations as 1000.