A Privacy-Friendly Approach to Data Valuation

Authors: Jiachen (Tianhao) Wang, Yuqing Zhu, Yu-Xiang Wang, Ruoxi Jia, Prateek Mittal

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we systematically evaluate the practical effectiveness of our proposed TKNN-Shapley method. Our evaluation aims to demonstrate the following points: (1) TKNN-Shapley offers improved runtime efficiency compared with KNN-Shapley. (2) The differentially private version of TKNN-Shapley (DP-TKNN-Shapley) achieves significantly better privacy-utility tradeoff compared to naively privatized KNN-Shapley in discerning data quality. (3) Non-private TKNN-Shapley maintains a comparable performance to the original KNN-Shapley. These observations highlight TKNN-Shapley s potential for data valuation in real-life applications.
Researcher Affiliation Academia Jiachen T. Wang Princeton University tianhaowang@princeton.edu Yuqing Zhu UC Santa Barbara yuqingzhu@ucsb.edu Yu-Xiang Wang UC Santa Barbara yuxiangw@cs.ucsb.edu Ruoxi Jia Virginia Tech ruoxijia@vt.edu Prateek Mittal Princeton University pmittal@princeton.edu
Pseudocode Yes Algorithm 1: Membership Inference Attack via KNN-Shapley Scores.
Open Source Code No The paper does not provide an explicit statement or link indicating that the source code for their methodology is open-source or publicly available.
Open Datasets Yes Datasets: We conduct our experiments on a diverse set of 13 datasets, where 11 of them have been used in previous data valuation studies [39, 60]. Additionally, we experiment on 2 NLP datasets (AG News [65] and DBPedia [2])... A comprehensive list of datasets and sources is summarized in Table 7." Table 7 lists sources like "[41]" for MNIST, "[38]" for CIFAR10, "[65]" for AGnews, "[2]" for DBPedia, and "https://www.openml.org/d/1218" for Click, etc. which are concrete access information for publicly available datasets.
Dataset Splits Yes A common choice for utility function is the validation accuracy of a model trained on the input training set. Formally, for a training set S, a utility function v(S) := acc(A(S)), where A is a learning algorithm that takes a dataset S as input and returns a model; acc( ) is a metric function that evaluates the performance of a given model, e.g., the classification accuracy on a hold-out validation set." and "The validation data size we use is 10% of the training data size.
Hardware Specification Yes Experiments were conducted with an AMD 64-Core CPU Processor.
Software Dependencies No The paper does not provide specific software dependencies with version numbers for replication (e.g., Python, PyTorch, scikit-learn versions).
Experiment Setup Yes for both TKNN/KNN-Shapley, we use the popular cosine distance as the distance measure [53], which is always bounded in [ 1, +1]. Throughout all experiments, we use τ = 0.5 and K = 5 for TKNN-/KNN-Shapley, respectively, as we found the two choices consistently work well across all datasets.