DAVINZ: Data Valuation using Deep Neural Networks at Initialization
Authors: Zhaoxuan Wu, Yao Shu, Bryan Kian Hsiang Low
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we theoretically derive a domain-aware generalization bound to estimate the generalization performance of DNNs without model training. We then exploit this theoretically derived generalization bound to develop a novel training-free data valuation method named data valuation at initialization (DAVINZ), which consistently achieves remarkable effectiveness and efficiency in practice. Moreover, our training-free DAVINZ, surprisingly, can even theoretically and empirically enjoy the desirable properties that training-based data valuation methods usually attain, thus making it more trustworthy in practice. Finally, we perform extensive comparisons with both training-based and training-free data valuation baselines to justify the effectiveness and efficiency of our DAVINZ as well as the desirable properties it enjoys (Sec. 6). |
| Researcher Affiliation | Academia | 1Institute of Data Science, National University of Singapore, Republic of Singapore 2Integrative Sciences and Engineering Programme, NUSGS, Republic of Singapore 3Department of Computer Science, National University of Singapore, Republic of Singapore. |
| Pseudocode | Yes | Algorithm 1 Data Valuation at Initialization (DAVINZ) |
| Open Source Code | No | The paper mentions using a third-party implementation: "We adopt the Py Torch autograd hacks implementation in this paper." However, it does not provide a link or explicit statement that the code developed for this paper's methodology is open-source or available. |
| Open Datasets | Yes | We use MNIST and CIFAR-10 for classification tasks and the ising physical model dataset (Mills & Tamblyn, 2019) for regression tasks. All MNIST and CIFAR-10 images are pre-processed by re-scaling the pixel value to the [0, 1] range. We split the images into 10 datasets each containing images of a single label class. The number of data samples in each dataset also varies from 1000 to 1250. The validation dataset contains 10K images from all class labels. Details of the data split can be found in Table 6. The ising physical model dataset aims to predict system energy based on an ising array of atomic spin states. We split the input data into 10 datasets... Details of the data split can be found in Table 7. |
| Dataset Splits | Yes | We split the images into 10 datasets each containing images of a single label class. The number of data samples in each dataset also varies from 1000 to 1250. The validation dataset contains 10K images from all class labels. This baseline setup mimics the practical scenario where a particular agent only have access to a specific type of data... The validation objective is, however, learning a grand model performs well on all label classes of this classification problem... In practice, we experience problems finding the true validation performance given a model architecture and a dataset. |
| Hardware Specification | Yes | All experiments have been run on a server with Intel(R) Xeon(R)@ 2.20GHz processors and 512GB RAM. One Tesla V100 GPU is used for the experiments. |
| Software Dependencies | No | The paper mentions "Py Torch autograd hacks implementation" but does not provide specific version numbers for PyTorch, autograd hacks, or any other software dependencies. |
| Experiment Setup | Yes | In our experiments, we fix batch size to 128 throughout and train all models with a learning rate of 0.01 until convergence for both MNIST and CIFAR-10 classification tasks. We train all models with a learning rate of 0.1 until convergence for the ising model regression task. Convergence is assumed when the training loss over two consecutive epochs is below a very small threshold of 10^-8. The only exception is for training the MLP10 model on the ising physical model dataset, we use a smaller threshold of 10^-10 to ensure convergence. For DAVINZ calculations, we additionally turn off the Batch Normalization layer to remove its effect on NTK evaluations. |