reproducibilityindex.ai

Distributionally Robust Data Valuation

Authors: Xiaoqiang Lin, Xinyi Xu, Zhaoxuan Wu, See-Kiong Ng, Bryan Kian Hsiang Low

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate that our approach outperforms existing data valuation approaches in data selection and data removal tasks on real-world datasets (e.g., housing price prediction, diabetes hospitalization prediction).
Researcher Affiliation	Academia	1Department of Computer Science, National University of Singapore, Singapore 2Institute of Data Science, National University of Singapore, Singapore.
Pseudocode	No	The paper describes methods and derivations but does not include explicit pseudocode blocks or algorithm listings.
Open Source Code	Yes	Our code is available at https://github.com/xqlin98/Distributionally-Robust-Data-Valuation.
Open Datasets	Yes	Datasets. (a) HOUSING (Kaggle, 2017), California housing price prediction. (b) UBER (Kaggle, 2018), carpool ride price prediction. (c) DIABETES (Strack et al., 2014), diabetes patients readmission prediction. (d) MNIST (Le Cun et al., 1990). (e) CIFAR-10 (Krizhevsky, 2009).
Dataset Splits	Yes	For the task of data selection, we select 45% (results for 20% and 80% are provided in Appendix B) of data points with the highest data values and train models using these selected data points and evaluate the DRGE of the resulting model. and we perform a k-means clustering on the features of data points, using k = 50 to split the large sampling dataset into 50 validation datasets.
Hardware Specification	Yes	All the experiments have been run on a server with Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz processor, 256GB RAM, and 4 NVIDIA Ge Force RTX 3080s.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with version numbers (e.g., Python version, PyTorch version, or other library versions).
Experiment Setup	Yes	For kernel regression, we use a radial basis function (RBF) kernel with a length scale of 2. For NN, we use a 3-layer multi-layer perceptron (MLP) for regression and a 2-layer convolutional neural network (CNN) followed by a fully connected layer for classification. The epoch number is 10 with a learning rate of 0.05 and batch size of 128. and Table 2. Hyperparameters for kernel regression model and Table 3. Hyperparameters for NN.