Optimizing Data Collection for Machine Learning
Authors: Rafid Mahmood, James Lucas, Jose M. Alvarez, Sanja Fidler, Marc Law
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We significantly reduce the risks of failing to meet desired performance targets on several classification, segmentation, and detection tasks, while maintaining low total collection costs. (...) We perform experiments over classification, segmentation, and detection tasks to show, on average, approximately a 2 reduction in the chances of failing to meet performance targets, versus estimation baselines. |
| Researcher Affiliation | Collaboration | 1NVIDIA 2University of Toronto 3Vector Institute |
| Pseudocode | Yes | Full details of the learning and optimization steps, including the complete Algorithm, are in Appendix B. |
| Open Source Code | No | The code is proprietary. |
| Open Datasets | Yes | We explore classification on CIFAR-10 [36], CIFAR-100 [36], and Image Net [37]... We explore semantic segmentation using Deeplabv3 [39] on BDD100K [40]... as well as Bird s-Eye-View (BEV) segmentation on nu Scenes [41]... We explore 2-D object detection on PASCAL VOC [43, 44]... |
| Dataset Splits | No | The paper mentions evaluating on a “validation data set” and initializing with “q0 = 10% of the full data set”, but it does not provide specific training/validation/test dataset split percentages, absolute sample counts for each split, or detailed splitting methodology needed for reproduction across all datasets. |
| Hardware Specification | Yes | All experiments were run on a single machine with 8 NVIDIA A100 GPUs and 40 Intel Xeon CPU cores (2.20GHz). |
| Software Dependencies | No | The paper mentions using “Python 3 and PyTorch” in Appendix E and “SciPy [49]” for the Levenberg-Marquardt algorithm, but does not provide specific version numbers for these software components or other libraries needed for replication. |
| Experiment Setup | Yes | We train all models for 100 epochs with a batch size of 256 for classification, 32 for segmentation, and 32 for detection. We use the AdamW optimizer with a learning rate of 0.001. |