Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits

Authors: Jiachen T. Wang, Tianji Yang, James Zou, Yongchan Kwon, Ruoxi Jia

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments aim to demonstrate the following assertions: (1) Data Shapley works well when the utility functions are being defined on heterogeneous datasets, (2) Data Shapley s effectiveness is strongly correlated with the fitting quality of MTM functions to the utility functions, and (3) The utility functions approximability by MTM functions is further correlated with their ρ-consistency index (deferred to Appendix C.4). In this section, to estimate Data Shapley, we use the most widely used permutation sampling estimator (Mitchell et al., 2022), where for each experiment the sampling budget is as high as 40,000 to reduce the instability in Shapley value estimation. Following Ghorbani & Zou (2019); Kwon & Zou (2022), we use logistic regression as the learning algorithm here in the main paper. Additional results with neural networks and detailed experiment settings are deferred to Appendix C.
Researcher Affiliation Academia 1Princeton University 2East China Normal University 3Stanford University 4Columbia University 5Virginia Tech.
Pseudocode No The paper describes algorithms and frameworks but does not include any clearly labeled "Pseudocode" or "Algorithm" blocks, nor structured steps resembling code.
Open Source Code No The paper does not contain any explicit statements about releasing source code for the methodology, nor does it provide any links to a code repository.
Open Datasets Yes An overview of the dataset information we used in Section 6 can be found in Table 2. These are commonly used datasets in the existing literature in data valuation (Ghorbani & Zou, 2019; Kwon & Zou, 2022; Jia et al., 2019b; Wang & Jia, 2023a; Kwon & Zou, 2023; Wang et al., 2024). ... Table 2: A summary of datasets used in Section 6 s experiments. Wind https://www.openml.org/d/847 CPU https://www.openml.org/d/761 Fraud (Dal Pozzolo et al., 2015) 2DPlanes https://www.openml.org/d/727 Vehicle (Duarte & Hu, 2004) Apsfail https://www.openml.org/d/41138 Pol https://www.openml.org/d/722
Dataset Splits Yes Following Kwon & Zou (2022), for the datasets that have multi-class, we binarize the label by considering 1[y = 1]. Given the large amount of model retraining required in our experiment, for each of the dataset we take a size-200 subset as the training set, and a size-2000 subset as the validation set.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using "logistic regression" and "two-layer MLP model" with "Adam optimizer", but does not specify any software libraries or frameworks with version numbers (e.g., PyTorch, TensorFlow, scikit-learn versions).
Experiment Setup Yes Here in Appendix, we also show the results when using a two-layer MLP model as the learning algorithm, where there are 100 neurons in the hidden layer, activation function Re LU, batch size 128, (initial) learning rate 10 2 and Adam optimizer for training. and We use batch size 32, (initial) learning rate 10 3, and Adam optimizer for training 10 epochs.