Data Distribution Valuation
Authors: Xinyi Xu, Shuaiqi Wang, Chuan Sheng Foo, Bryan Kian Hsiang Low, Giulia Fanti
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate that our method is sample-efficient and effective in identifying valuable data distributions against several existing baselines, on multiple real-world datasets (e.g., network intrusion detection, credit card fraud detection) and downstream applications (classification, regression). |
| Researcher Affiliation | Collaboration | Xinyi Xu Department of Computer Science National University of Singapore xinyi.xu@u.nus.edu Shuaiqi Wang Department of Electrical and Computer Engineering Carnegie Mellon University shuaiqiw@andrew.cmu.edu Chuan-Sheng Foo Institute for Infocomm Research Agency for Science, Technology and Research foo_chuan_sheng@i2r.a-star.edu.sg Bryan Kian Hsiang Low Department of Computer Science National University of Singapore lowkh@comp.nus.edu.sg Giulia Fanti Department of Electrical and Computer Engineering Carnegie Mellon University gfanti@andrew.cmu.edu |
| Pseudocode | No | The paper describes methods and definitions in text and mathematical formulas but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/Xinyi YS/Data_Distribution_Valuation. |
| Open Datasets | Yes | Cali H (resp. King H) is a housing prices dataset in California [35] (resp. in Kings county [28]). Census15 (resp. Census17) is a personal income prediction dataset from the 2015 (resp. 2017) US census. [48]. Credit7 [49] and Credit31 [3] are two credit card fraud detection datasets. TON [47] and UGR16 [45] are two network intrusion detection datasets. |
| Dataset Splits | No | The paper discusses the use of a 'validation set Dval' for baselines that explicitly require it for empirical comparison, but does not specify training, validation, and test splits for its own experimental setup in terms of percentages or counts. |
| Hardware Specification | Yes | Our experiments are run on a server with Intel(R) Xeon(R) Gold 6226R CPU @2.90GHz and 4 NVIDIA Ge Force RTX 3080 s (each with 10 GBs memory). |
| Software Dependencies | No | The paper mentions software like 'radial basis function kernel' and 'MMD-GAN', but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | ML model M. For model-specific baselines such as DAVINZ and CS, in Sec. 5.2, we adopt a 2-layer convolutional neural network (CNN) for MNIST, EMNIST, Fa MNIST; Res Net-18 [29] for CIFAR10 and CIFAR100; logistic regression (Log Reg) for Credit7 and Credit31, and TON and UGR16; linear regression (LR) for Cali H and King H, and Census15 and Census17. Details are in App. D. Table 4 provides an overall summary of the experimental settings. |