Optimizing Data Collection for Machine Learning

Authors: Rafid Mahmood, James Lucas, Jose M. Alvarez, Sanja Fidler, Marc Law

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We significantly reduce the risks of failing to meet desired performance targets on several classification, segmentation, and detection tasks, while maintaining low total collection costs. (...) We perform experiments over classification, segmentation, and detection tasks to show, on average, approximately a 2 reduction in the chances of failing to meet performance targets, versus estimation baselines.
Researcher Affiliation Collaboration 1NVIDIA 2University of Toronto 3Vector Institute
Pseudocode Yes Full details of the learning and optimization steps, including the complete Algorithm, are in Appendix B.
Open Source Code No The code is proprietary.
Open Datasets Yes We explore classification on CIFAR-10 [36], CIFAR-100 [36], and Image Net [37]... We explore semantic segmentation using Deeplabv3 [39] on BDD100K [40]... as well as Bird s-Eye-View (BEV) segmentation on nu Scenes [41]... We explore 2-D object detection on PASCAL VOC [43, 44]...
Dataset Splits No The paper mentions evaluating on a “validation data set” and initializing with “q0 = 10% of the full data set”, but it does not provide specific training/validation/test dataset split percentages, absolute sample counts for each split, or detailed splitting methodology needed for reproduction across all datasets.
Hardware Specification Yes All experiments were run on a single machine with 8 NVIDIA A100 GPUs and 40 Intel Xeon CPU cores (2.20GHz).
Software Dependencies No The paper mentions using “Python 3 and PyTorch” in Appendix E and “SciPy [49]” for the Levenberg-Marquardt algorithm, but does not provide specific version numbers for these software components or other libraries needed for replication.
Experiment Setup Yes We train all models for 100 epochs with a batch size of 256 for classification, 32 for segmentation, and 32 for detection. We use the AdamW optimizer with a learning rate of 0.001.