A Distributional Framework For Data Valuation

Authors: Amirata Ghorbani, Michael Kim, James Zou

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply distributional Shapley to diverse data sets and demonstrate its utility in a data market setting. We investigate the empirical effectiveness of the distributional Shapley framework by running experiments in three settings on large real-world data sets. As Figure 1 demonstrates, when training a logistic regression model, removing the high distributional Shapley valued points causes a sharp decrease in accuracy on both tasks, even when using the most aggressive weighted sampling and interpolation optimizations.
Researcher Affiliation Academia 1Department of Electrical Engineering, Stanford University, CA, USA 2Department of Computer Science, Stanford University, CA, USA 3Department of Biomedical Data Science, Stanford University, CA, USA.
Pseudocode Yes Algorithm 1 D-SHAPLEY Fix: potential U : Z [0, 1]; distribution D; m N Given: data set Z Z to valuate; # iterations T N. Algorithm 2 FAST-D-SHAPLEY Fix: potential U : Z [0, 1]; distribution D; m N Given: valuation set Z Z; database B DM; # iterations T N; subsampling rate p [0, 1]; importance weights {wk}; regression algorithm R
Open Source Code Yes Code is available on Github at https://github.com/amiratag/DistributionalShapley
Open Datasets Yes The first setting uses the UK Biobank data set, containing the genotypic and phenotypic data of individuals in the UK (Sudlow et al., 2015); ... The second data set is Adult Income where the task is to predict whether income exceeds $50K/yr given 14 personal features (Dua & Graff, 2017). ... we estimate the values of 50K images from the CIFAR10 data set
Dataset Splits No The paper mentions using a "hold-out set" for performance metric and using a "database" for sampling, but does not explicitly detail distinct training, validation, and test dataset splits with percentages or sample counts for hyperparameter tuning or early stopping.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper mentions training a "logistic regression model" and other models but does not specify the version numbers of software libraries or frameworks used (e.g., PyTorch, TensorFlow, scikit-learn versions).
Experiment Setup No The paper discusses algorithmic parameters like "subsampling rate p" and "importance weights {wk}" for D-SHAPLEY, but it does not provide specific hyperparameters for the machine learning models (e.g., learning rate, batch size, number of epochs for logistic regression or other models) that were trained as part of the experiments.