Dataset Pruning: Reducing Training Data by Examining Generalization Influence

Authors: Shuo Yang, Zeke Xie, Hanyu Peng, Min Xu, Mingming Sun, Ping Li

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The empirically observed generalization gap of dataset pruning is substantially consistent with our theoretical expectations. Furthermore, the proposed method prunes 40% training examples on the CIFAR-10 dataset, halves the convergence time with only 1.3% test accuracy decrease, which is superior to previous score-based sample selection methods.
Researcher Affiliation Collaboration 1School of Electrical and Data Engineering, University of Technology Sydney 2Cognitive Computing Lab, Baidu Research
Pseudocode Yes Algorithm 1 Generalization guaranteed dataset pruning.
Open Source Code No The paper does not contain an explicit statement about releasing the source code for the proposed methodology or a link to a code repository.
Open Datasets Yes We evaluate dataset pruning methods on CIFAR10, CIFAR100 Krizhevsky (2009), and Tiny Image Net Le & Yang (2015) datasets.
Dataset Splits No The paper mentions using a 'validation' set in Table 1 ("validation accuracies"), but it does not provide explicit details about the split percentages, sample counts, or the methodology for creating the training, validation, and test splits needed for reproduction.
Hardware Specification Yes Time cost (min) 113 113 113 113 3029 ... time of training 720 architectures on a Tesla V100 GPU
Software Dependencies No The paper mentions optimizers (SGD), but does not provide specific version numbers for software dependencies such as programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or GPU acceleration libraries (e.g., CUDA).
Experiment Setup Yes Specifically, in all experiments, we train the model for 200 epochs with a batch size of 128, a learning rate of 0.01 with cosine annealing learning rate decay strategy, SGD optimizer with the momentum of 0.9 and weight decay of 5e-4, data augmentation of random crop and random horizontal flip.