Data Acquisition via Experimental Design for Data Markets

Authors: Charles Lu, Baihe Huang, Sai Praneeth Karimireddy, Praneeth Vepakomma, Michael Jordan, Ramesh Raskar

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our proposed method for data acquisition (DAVED) against common data valuation methods on both synthetic data and four real-world medical: 1. Fitzpatrick17K [24], a skin lesion dataset, where the task is to predict Fitzpatrick skin tone on a 6-point scale from dermatology images. 2. RSNA Pediatric Bone Age dataset [25], where the task is to assess bone age (in months) from X-ray images of an infant s hand. 3. Medical Information Mart for Intensive Care (MIMIC-III) [31], where the task is to predict the length of hospital stay from 48 attributes such as demographics, insurance, and medical conditions. 4. Drug Lib reviews [34], text reviews of drugs where the task is to predict ratings (1-10). For validation-based methods, we use a validation set of 100 datapoints. We report mean test errors over 100 buyers.
Researcher Affiliation Academia Charles Lu MIT Baihe Huang UC Berkeley Sai Praneeth Karimireddy USC, UC Berkeley Praneeth Vepakomma MBZUAI, MIT Michael I. Jordan UC Berkeley Ramesh Raskar MIT
Pseudocode Yes Algorithm 1 DAVED: Iterative Optimization Procedure
Open Source Code Yes Our code is available at this repo: https://github.com/clu5/ data-acquisition-via-experimental-design. For reproducibility, our full implementation is available at: https://github.com/clu5/ data-acquisition-via-experimental-design.
Open Datasets Yes The RSNA Pediatric Bone Age Challenge (2017) dataset [25] may be downloaded here https://www.rsna.org/rsnai/ai-image-challenge/ rsna-pediatric-bone-age-challenge-2017. The Fitzpatrick17K [24] can be downloaded from here https://github.com/mattgroh/ fitzpatrick17k. The MIMIC dataset [31] can be accessed here https://physionet.org/content/ mimiciii/1.4/. The Drug Lib dataset [34] can be downloaded here https://archive.ics.uci.edu/ dataset/461/drug+review+dataset+druglib+com.
Dataset Splits Yes For validation-based methods, we use a validation set of 100 datapoints. Validation split: 100 points for baseline data valuation methods
Hardware Specification Yes We conduct all experiments on an Intel Xeon E5-2620 CPU with 40 cores and a Nvidia GTX 1080 Ti GPU.
Software Dependencies Yes For implementation of baseline data valuation methods, we use the Open Data Val package [30] version 1.2.1.
Experiment Setup Yes In our experiments, we use the following setting of hyperparameters for DAVED: 500 iterations for multi-step variant, 1 iteration for single-step variance Line search for step size α (0, 0.9) Regularization λ = 0 (unless otherwise specified) No early stopping. For each test point, we train a linear regression model on the selected seller points and report test mean squared error (MSE) on the buyer s data and average test error over 100 buyers.