2D-Shapley: A Framework for Fragmented Data Valuation

Authors: Zhihong Liu, Hoang Anh Just, Xiangyu Chang, Xi Chen, Ruoxi Jia

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper presents the first focused study on data valuation without assuming shared feature space or sample space. Toward that end, we make the following contributions. We present an approach that enables evaluation of the marginal contribution of a block... We abstract the block valuation problem into a two-dimensional (2D) cooperative game... We propose axioms that a proper valuation scheme should satisfy... We demonstrate that 2D-Shapley enables new applications... Section 4. Experiments. This section covers the two general application scenarios of 2D-Shapley. (1) Cell valuation, where each cell in the training data matrix is considered a data source and receives a score indicating its contribution to a learning task performed on the matrix... (2) Sub-matrix valuation, where a sub-matrix containing multiple cells is considered a data source and receives a joint score.
Researcher Affiliation Academia 1Center for Intelligent Decision-Making and Machine Learning, Department of Information Systems and Intelligent Business, School of Management, Xi an Jiaotong University, Xi an, 710049, China. 2Bradley Department of Electrical and Computer Engineering, Virginia Tech, Virginia, USA. 3Department of Technology, Operations, and Statistics, Stern School of Business, New York University, New York, 10012, USA.
Pseudocode Yes The full details of the algorithm design are provided in Appendix E, and the pseudo-code is shown in Algorithm 1... The pseudo-code for the overall KNN-based approximation is provided in Algorithm 2.
Open Source Code Yes *Equal contribution. Code repository publicly available: https://github.com/ruoxi-jia-group/2dshapley
Open Datasets Yes For our experiments, we use the following datasets from Machine Learning Repository (Dua & Graff, 2017): Dataset Training Data Test Data Features Census Income... Default of Credit Card Clients... Heart Failure... Breast Cancer Wisconsin (Original)... Wine Dataset...
Dataset Splits No The paper lists training and test data sizes in Table 2 but does not specify a separate validation split or the methodology for creating one. While it mentions a "hold-out validation set" as a general concept for evaluating model performance in Section 2, it does not state that one was used in their specific experiments.
Hardware Specification Yes In this work, we used an 8-Core Intel Xeon Processor E5-2620 v4 @ 2.20Ghz CPU server as a hardware platform.
Software Dependencies No The paper states that a "decision tree classifier" was implemented for certain methods. However, it does not provide specific version numbers for any software libraries, frameworks (like scikit-learn, TensorFlow, PyTorch), or programming languages used in the experiments.
Experiment Setup Yes Empirically, we verified that for each of the method, the cell values converge within 500 permutations and that is the number we decide to use to run these methods... For bigger datasets, Census Income and Credit Default, we remove ten cells at a time, and for a smaller dataset, Breast Cancer, we remove one cell at a time... for a random cell with a sample index i and a feature index j, we generate an outlier value based on its feature j. We first recreate a distribution of the feature j and then sample a value from a low-probability-density region, below 5% in our experiment.