Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A Privacy-Friendly Approach to Data Valuation

Authors: Jiachen (Tianhao) Wang, Yuqing Zhu, Yu-Xiang Wang, Ruoxi Jia, Prateek Mittal

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we systematically evaluate the practical effectiveness of our proposed TKNN-Shapley method. Our evaluation aims to demonstrate the following points: (1) TKNN-Shapley offers improved runtime efficiency compared with KNN-Shapley. (2) The differentially private version of TKNN-Shapley (DP-TKNN-Shapley) achieves significantly better privacy-utility tradeoff compared to naively privatized KNN-Shapley in discerning data quality. (3) Non-private TKNN-Shapley maintains a comparable performance to the original KNN-Shapley. These observations highlight TKNN-Shapley s potential for data valuation in real-life applications.
Researcher Affiliation Academia Jiachen T. Wang Princeton University EMAIL Yuqing Zhu UC Santa Barbara EMAIL Yu-Xiang Wang UC Santa Barbara EMAIL Ruoxi Jia Virginia Tech EMAIL Prateek Mittal Princeton University EMAIL
Pseudocode Yes Algorithm 1: Membership Inference Attack via KNN-Shapley Scores.
Open Source Code No The paper does not provide an explicit statement or link indicating that the source code for their methodology is open-source or publicly available.
Open Datasets Yes Datasets: We conduct our experiments on a diverse set of 13 datasets, where 11 of them have been used in previous data valuation studies [39, 60]. Additionally, we experiment on 2 NLP datasets (AG News [65] and DBPedia [2])... A comprehensive list of datasets and sources is summarized in Table 7." Table 7 lists sources like "[41]" for MNIST, "[38]" for CIFAR10, "[65]" for AGnews, "[2]" for DBPedia, and "https://www.openml.org/d/1218" for Click, etc. which are concrete access information for publicly available datasets.
Dataset Splits Yes A common choice for utility function is the validation accuracy of a model trained on the input training set. Formally, for a training set S, a utility function v(S) := acc(A(S)), where A is a learning algorithm that takes a dataset S as input and returns a model; acc( ) is a metric function that evaluates the performance of a given model, e.g., the classification accuracy on a hold-out validation set." and "The validation data size we use is 10% of the training data size.
Hardware Specification Yes Experiments were conducted with an AMD 64-Core CPU Processor.
Software Dependencies No The paper does not provide specific software dependencies with version numbers for replication (e.g., Python, PyTorch, scikit-learn versions).
Experiment Setup Yes for both TKNN/KNN-Shapley, we use the popular cosine distance as the distance measure [53], which is always bounded in [ 1, +1]. Throughout all experiments, we use τ = 0.5 and K = 5 for TKNN-/KNN-Shapley, respectively, as we found the two choices consistently work well across all datasets.