Distributionally Robust Data Valuation
Authors: Xiaoqiang Lin, Xinyi Xu, Zhaoxuan Wu, See-Kiong Ng, Bryan Kian Hsiang Low
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that our approach outperforms existing data valuation approaches in data selection and data removal tasks on real-world datasets (e.g., housing price prediction, diabetes hospitalization prediction). |
| Researcher Affiliation | Academia | 1Department of Computer Science, National University of Singapore, Singapore 2Institute of Data Science, National University of Singapore, Singapore. |
| Pseudocode | No | The paper describes methods and derivations but does not include explicit pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Our code is available at https://github.com/xqlin98/Distributionally-Robust-Data-Valuation. |
| Open Datasets | Yes | Datasets. (a) HOUSING (Kaggle, 2017), California housing price prediction. (b) UBER (Kaggle, 2018), carpool ride price prediction. (c) DIABETES (Strack et al., 2014), diabetes patients readmission prediction. (d) MNIST (Le Cun et al., 1990). (e) CIFAR-10 (Krizhevsky, 2009). |
| Dataset Splits | Yes | For the task of data selection, we select 45% (results for 20% and 80% are provided in Appendix B) of data points with the highest data values and train models using these selected data points and evaluate the DRGE of the resulting model. and we perform a k-means clustering on the features of data points, using k = 50 to split the large sampling dataset into 50 validation datasets. |
| Hardware Specification | Yes | All the experiments have been run on a server with Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz processor, 256GB RAM, and 4 NVIDIA Ge Force RTX 3080s. |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with version numbers (e.g., Python version, PyTorch version, or other library versions). |
| Experiment Setup | Yes | For kernel regression, we use a radial basis function (RBF) kernel with a length scale of 2. For NN, we use a 3-layer multi-layer perceptron (MLP) for regression and a 2-layer convolutional neural network (CNN) followed by a fully connected layer for classification. The epoch number is 10 with a learning rate of 0.05 and batch size of 128. and Table 2. Hyperparameters for kernel regression model and Table 3. Hyperparameters for NN. |