LAVA: Data Valuation without Pre-Specified Learning Algorithms

Authors: Hoang Anh Just, Feiyang Kang, Tianhao Wang, Yi Zeng, Myeongseob Ko, Ming Jin, Ruoxi Jia

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENTS In this section, we demonstrate the practical efficacy and efficiency of LAVA on various classification datasets. We compare with nine baselines...
Researcher Affiliation Academia Hoang Anh Just 1, Feiyang Kang *1, Jiachen T. Wang2, Yi Zeng1, Myeongseob Ko1, Ming Jin1, and Ruoxi Jia1 1Virginia Tech, 2Princeton University
Pseudocode No The paper describes its methods using mathematical formulations and prose, but it does not contain clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Repository publicly available on Github: https://github.com/ruoxi-jia-group/LAVA.
Open Datasets Yes We evaluate on five different use cases of data valuation: detecting backdoor attack, poisoning attack, noisy features, mislabeled data, and irrelevant data. ... We attack the German Traffic Sign dataset (GTSRB)... Here, we show the usage of LAVA on the MNIST dataset... We consider a CIFAR-10 dataset... We demonstrate computation efficiency on a larger scale dataset (100,000 samples) with higher dimensions, Image Net-100.
Dataset Splits Yes For all methods to be compared, a validation set of 10, 000 samples is assumed.
Hardware Specification Yes A server with an NVIDIA Tesla P100-PCIE-16GB graphic card is used as the hardware platform in this work.
Software Dependencies No For our implementation, we use PyTorch for the main framework (Paszke et al., 2019), assisted by three main libraries, which are otdd (optimal transport calculation setup with datasets) (Alvarez-Melis & Fusi, 2020), geomloss (actual optimal transport calculation) (Feydy et al., 2019), and numpy (tool for array routines) (Harris et al., 2020).
Experiment Setup No Details about datasets, models, hyperparameter settings, and ablation studies of the hyperparameters and validation sizes are provided in Appendix B. (This sentence promises details, but Appendix B.9 only explicitly states the epsilon value (0.1), and ablation studies cover other parameters without explicitly listing the default values for the main experiments, nor does it provide common training hyperparameters like learning rate, batch size, or optimizer settings for the deep neural network trained for feature extraction.)