Data Acquisition via Experimental Design for Data Markets
Authors: Charles Lu, Baihe Huang, Sai Praneeth Karimireddy, Praneeth Vepakomma, Michael Jordan, Ramesh Raskar
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our proposed method for data acquisition (DAVED) against common data valuation methods on both synthetic data and four real-world medical: 1. Fitzpatrick17K [24], a skin lesion dataset, where the task is to predict Fitzpatrick skin tone on a 6-point scale from dermatology images. 2. RSNA Pediatric Bone Age dataset [25], where the task is to assess bone age (in months) from X-ray images of an infant s hand. 3. Medical Information Mart for Intensive Care (MIMIC-III) [31], where the task is to predict the length of hospital stay from 48 attributes such as demographics, insurance, and medical conditions. 4. Drug Lib reviews [34], text reviews of drugs where the task is to predict ratings (1-10). For validation-based methods, we use a validation set of 100 datapoints. We report mean test errors over 100 buyers. |
| Researcher Affiliation | Academia | Charles Lu MIT Baihe Huang UC Berkeley Sai Praneeth Karimireddy USC, UC Berkeley Praneeth Vepakomma MBZUAI, MIT Michael I. Jordan UC Berkeley Ramesh Raskar MIT |
| Pseudocode | Yes | Algorithm 1 DAVED: Iterative Optimization Procedure |
| Open Source Code | Yes | Our code is available at this repo: https://github.com/clu5/ data-acquisition-via-experimental-design. For reproducibility, our full implementation is available at: https://github.com/clu5/ data-acquisition-via-experimental-design. |
| Open Datasets | Yes | The RSNA Pediatric Bone Age Challenge (2017) dataset [25] may be downloaded here https://www.rsna.org/rsnai/ai-image-challenge/ rsna-pediatric-bone-age-challenge-2017. The Fitzpatrick17K [24] can be downloaded from here https://github.com/mattgroh/ fitzpatrick17k. The MIMIC dataset [31] can be accessed here https://physionet.org/content/ mimiciii/1.4/. The Drug Lib dataset [34] can be downloaded here https://archive.ics.uci.edu/ dataset/461/drug+review+dataset+druglib+com. |
| Dataset Splits | Yes | For validation-based methods, we use a validation set of 100 datapoints. Validation split: 100 points for baseline data valuation methods |
| Hardware Specification | Yes | We conduct all experiments on an Intel Xeon E5-2620 CPU with 40 cores and a Nvidia GTX 1080 Ti GPU. |
| Software Dependencies | Yes | For implementation of baseline data valuation methods, we use the Open Data Val package [30] version 1.2.1. |
| Experiment Setup | Yes | In our experiments, we use the following setting of hyperparameters for DAVED: 500 iterations for multi-step variant, 1 iteration for single-step variance Line search for step size α (0, 0.9) Regularization λ = 0 (unless otherwise specified) No early stopping. For each test point, we train a linear regression model on the selected seller points and report test mean squared error (MSE) on the buyer s data and average test error over 100 buyers. |