2D-OOB: Attributing Data Contribution Through Joint Valuation Framework
Authors: Yifan Sun, Jingyan Shen, Yongchan Kwon
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our comprehensive experiments demonstrate that 2D-OOB achieves state-of-the-art performance across multiple use cases while being exponentially faster. Specifically, 2D-OOB shows promising results in detecting and rectifying fine-grained outliers at the cell level, and localizing backdoor triggers in data poisoning attacks. In this section, we empirically show the effectiveness of 2D-OOB across multiple use cases of the joint valuation: cell-level outlier detection, cell fixation, and backdoor trigger detection. |
| Researcher Affiliation | Academia | Yifan Sun University of Illinois Urbana-Champaign yifan50@illinois.edu Jingyan Shen Columbia University js5544@columbia.edu Yongchan Kwon Columbia University yk3012@columbia.edu |
| Pseudocode | No | The paper describes algorithms and formulations mathematically but does not include structured pseudocode blocks or algorithms labeled as such. |
| Open Source Code | Yes | Code repository can be found at https://github.com/yifansun99/2D-OOB-Joint-Valuation. |
| Open Datasets | Yes | We use 12 publicly accessible binary classification datasets from Open ML, encompassing a range of both low and high-dimensional datasets, which have been widely used in the literature [13, 25, 26]. Details on these datasets are presented in Appendix A.1. We select 5 pairs of classes from CIFAR-10 [24]. |
| Dataset Splits | Yes | For each dataset, 1000 and 3000 data points are randomly sampled for training and test datasets, respectively. Note that for methods that need a validation dataset such as KNNShapley and Data Shapley, we additionally sample a separate validation dataset (disjoint from training dataset and test dataset) to evaluate the utility function. The size of the validation dataset is set to 10% of the training sample size. |
| Hardware Specification | Yes | The run time is measured on a single Intel Xeon Gold 6226 2.9 GHz CPU processor. |
| Software Dependencies | No | The paper mentions using "scikit-learn" for the subset random forest model, but does not provide specific version numbers for this or other software components. |
| Experiment Setup | Yes | Throughout all of our experiments, 2D-OOB uses a subset bagging model with B = 1000 decision trees. We randomly select a fixed ratio of features to build each decision tree. Unless otherwise specified, we utilize half of the features for each weak learner and set T(yi, ˆf(xi,Sb)) = 1(yi = ˆf(xi,Sb)). For the baseline method, we consider 2D-KNN, a fast and performant variant of 2D-Shapley [32]. We incorporate a distance regularization term in the utility function T for enhanced performance. For 2D-KNN, we set the number of nearest neighbors as 10 and the number of permutations as 1000. |