Selection via Proxy: Efficient Data Selection for Deep Learning

Authors: Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, Matei Zaharia

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate this selection via proxy (SVP) approach on several data selection tasks across five datasets: CIFAR10, CIFAR100, Image Net, Amazon Review Polarity, and Amazon Review Full. For active learning, applying SVP can give an order of magnitude improvement in data selection runtime (i.e., the time it takes to repeatedly train and select points) without significantly increasing the final error (often within 0.1%).
Researcher Affiliation Academia Cody Coleman , Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, Matei Zaharia Stanford University
Pseudocode Yes Algorithm 1 GREEDY K-CENTERS (WOLF, 2011; SENER & SAVARESE, 2018) Input: data xi, existing pool s0, trained model AT 0 , and a budget b
Open Source Code Yes Code available at https://github.com/stanford-futuredata/selection-via-proxy
Open Datasets Yes Our experiments included three image classification datasets: CIFAR10, CIFAR100 (Krizhevsky & Hinton, 2009), and Image Net (Russakovsky et al., 2015); and two text classification datasets: Amazon Review Polarity and Full (Zhang & Le Cun, 2015; Zhang et al., 2015).
Dataset Splits Yes Both datasets contain 50,000 images for training and 10,000 images for testing. Image Net has 1.28 million training images and 50,000 validation images that belong to 1 of 1,000 classes.
Hardware Specification Yes Core-set selection experiments used a single Nvidia P100 GPU, while the active learning experiments used a Titan V GPU. For training, we used a custom machine with 4 Nvidia Titan V GPUs and followed Nvidia s optimized implementation.
Software Dependencies No The paper mentions 'Py Torch 1' and the 'apex' library, but does not provide specific version numbers for these software components. For example, 'Py Torch 1' is not a precise version like 'PyTorch 1.9'.
Experiment Setup Yes Starting with an initial random subset of 2% of the data, we selected 8% of the remaining unlabeled data for the first round and 10% for subsequent rounds until the labeled data reached the budget b and retrained the models from scratch between rounds as described in Section 2.1. We followed the same training procedure, initialization, and hyperparameters as He et al. (2016b) with the exception of weight decay, which was set to 0.0005 and decreased the model s validation error in all conditions. For active learning, we used the same batch size of 768 images for both Res Net18 and Res Net50 for simplicity, which was the maximum batch size that could fit into memory for Res Net50.