DeRDaVa: Deletion-Robust Data Valuation for Machine Learning

Authors: Xiao Tian, Rachael Hwee Ling Sim, Jue Fan , Bryan Kian Hsiang Low

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also empirically demonstrate the practicality of our solutions.
Researcher Affiliation Academia Xiao Tian1,2, Rachael Hwee Ling Sim1, Jue Fan1,2, Bryan Kian Hsiang Low1 1 Department of Computer Science, National University of Singapore 2 Department of Mathematics, National University of Singapore {xiao.tian, rachael.sim, jue.fan}@u.nus.edu, lowkh@comp.nus.edu.sg
Pseudocode Yes The justification and pseudocode for 012-MCMC algorithm are included in App. D.2.
Open Source Code No The paper does not include any statement or link providing access to the open-source code for the methodology described.
Open Datasets Yes Our experiments use the following [model-dataset] combinations: [NB-CC] Naive Bayes trained on Credit Card (Yeh and Lien 2009), [NB-Db] Naive Bayes trained on Diabetes (Carrion, Dustin 2022), [NB-Wd] Naive Bayes trained on Wind (Vanschoren, Joaquin 2014), [SVM-Db] Support Vector Machine trained on Diabetes, and [LR-Pm] Logistic Regression trained on Phoneme (Grin, Leo 2022).
Dataset Splits No While the paper mentions "validation accuracy" in a general definition, it does not specify the explicit training/validation/test splits used for its own experiments, such as percentages or sample counts for a validation set.
Hardware Specification Yes The experiments are performed on a 64-bit Linux server with 256GB RAM and two Intel Xeon E5-2690 CPUs.
Software Dependencies Yes We implemented our solutions using Python 3.9.7 with scikit-learn 1.0.2.
Experiment Setup Yes For all experiments, we used Adam optimizer with learning rate 0.001 and batch size 64. The model training terminates when the validation loss does not improve for 10 epochs or after a maximum of 100 epochs.