Data Shapley: Equitable Valuation of Data for Machine Learning
Authors: Amirata Ghorbani, James Zou
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we demonstrate the estimation and applications of data Shapley across systematic experiments on real and synthetic data. Extensive experiments across biomedical, image and synthetic data demonstrate that data Shapley has several other benefits: 1) it is more powerful than the popular leave-one-out or leverage score in providing insight on what data is more valuable for a given learning task; 2) low Shapley value data effectively capture outliers and corruptions; 3) high Shapley value data inform what type of new data to acquire to improve the predictor. |
| Researcher Affiliation | Academia | 1Department of Electrical Engineering, Stanford University, Stanford, CA, USA 2Department of Biomedical Data Science, Stanford University, Stanford, CA, USA. |
| Pseudocode | Yes | Algorithm 1 Truncated Monte Carlo Shapley, Algorithm 2 Gradient Shapley |
| Open Source Code | No | The paper does not provide an explicit statement about the release of source code for the described methodology, nor does it include a link to a code repository. |
| Open Datasets | Yes | In this experiment, we use the UK Biobank data set (Sudlow et al., 2015); we use the spam classification data set (Metsis et al., 2006); flower image classification data set (adapted from https://goo.gl/Xgr1a1); Fashion MNIST data set(Xiao et al., 2017); Dog vs. Fish data set introduced in (Koh & Liang, 2017). For each class, 1200 images are extracted from Imagenet (Russakovsky et al., 2015); hospital readmission data set (Strack et al., 2014) |
| Dataset Splits | No | In all of the following experiments, we have a train set, a separate test set used for calculating V , and a held-out set used for reporting the final results of each figure. The paper does not explicitly state details about a separate 'validation' dataset split or provide specific percentages/counts for train/validation/test splits. |
| Hardware Specification | No | For all the experiments, calculating data Shapley values took less than 24 hours on four machines running in parallel (each with 4 cpus) except for one of the experiments where the model is a Conv-Net for which 4 GPUs were utilized in parallel for 120 hours. The paper does not provide specific model numbers for CPUs or GPUs. |
| Software Dependencies | No | The paper mentions machine learning models such as 'Logistic regression', 'Multinomial Naive Bayes model', 'Inception-V3 model', 'convolutional neural network', 'Random Forest regression model', and 'gradient boosting classifier'. However, it does not specify any software libraries or frameworks with their version numbers required for reproduction. |
| Experiment Setup | Yes | Our convergence criteria for TMC-Shapley and G-Shapley is 1 n Pn i=1 |φt i φt 100 i | |φt i| < 0.05. We randomly flip the label for 20% of training points. We corrupt 10% of train data by adding white noise and compute the average TMC-Shapley value of clean and noisy images and repeat the same experiment with different levels of noise. we perform hyper-parameter search for the learning algorithm to find the one resulting best performance for a model trained on only one pass of the data which, in our experiments, result in learning rates bigger than ones used for multi-epoch model training. |