reproducibilityindex.ai

Data Shapley: Equitable Valuation of Data for Machine Learning

Authors: Amirata Ghorbani, James Zou

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we demonstrate the estimation and applications of data Shapley across systematic experiments on real and synthetic data. Extensive experiments across biomedical, image and synthetic data demonstrate that data Shapley has several other beneﬁts: 1) it is more powerful than the popular leave-one-out or leverage score in providing insight on what data is more valuable for a given learning task; 2) low Shapley value data effectively capture outliers and corruptions; 3) high Shapley value data inform what type of new data to acquire to improve the predictor.
Researcher Affiliation	Academia	1Department of Electrical Engineering, Stanford University, Stanford, CA, USA 2Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
Pseudocode	Yes	Algorithm 1 Truncated Monte Carlo Shapley, Algorithm 2 Gradient Shapley
Open Source Code	No	The paper does not provide an explicit statement about the release of source code for the described methodology, nor does it include a link to a code repository.
Open Datasets	Yes	In this experiment, we use the UK Biobank data set (Sudlow et al., 2015); we use the spam classiﬁcation data set (Metsis et al., 2006); ﬂower image classiﬁcation data set (adapted from https://goo.gl/Xgr1a1); Fashion MNIST data set(Xiao et al., 2017); Dog vs. Fish data set introduced in (Koh & Liang, 2017). For each class, 1200 images are extracted from Imagenet (Russakovsky et al., 2015); hospital readmission data set (Strack et al., 2014)
Dataset Splits	No	In all of the following experiments, we have a train set, a separate test set used for calculating V , and a held-out set used for reporting the ﬁnal results of each ﬁgure. The paper does not explicitly state details about a separate 'validation' dataset split or provide specific percentages/counts for train/validation/test splits.
Hardware Specification	No	For all the experiments, calculating data Shapley values took less than 24 hours on four machines running in parallel (each with 4 cpus) except for one of the experiments where the model is a Conv-Net for which 4 GPUs were utilized in parallel for 120 hours. The paper does not provide specific model numbers for CPUs or GPUs.
Software Dependencies	No	The paper mentions machine learning models such as 'Logistic regression', 'Multinomial Naive Bayes model', 'Inception-V3 model', 'convolutional neural network', 'Random Forest regression model', and 'gradient boosting classiﬁer'. However, it does not specify any software libraries or frameworks with their version numbers required for reproduction.
Experiment Setup	Yes	Our convergence criteria for TMC-Shapley and G-Shapley is 1 n Pn i=1 \|φt i φt 100 i \| \|φt i\| < 0.05. We randomly ﬂip the label for 20% of training points. We corrupt 10% of train data by adding white noise and compute the average TMC-Shapley value of clean and noisy images and repeat the same experiment with different levels of noise. we perform hyper-parameter search for the learning algorithm to ﬁnd the one resulting best performance for a model trained on only one pass of the data which, in our experiments, result in learning rates bigger than ones used for multi-epoch model training.