reproducibilityindex.ai

Dataset Pruning: Reducing Training Data by Examining Generalization Influence

Authors: Shuo Yang, Zeke Xie, Hanyu Peng, Min Xu, Mingming Sun, Ping Li

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The empirically observed generalization gap of dataset pruning is substantially consistent with our theoretical expectations. Furthermore, the proposed method prunes 40% training examples on the CIFAR-10 dataset, halves the convergence time with only 1.3% test accuracy decrease, which is superior to previous score-based sample selection methods.
Researcher Affiliation	Collaboration	1School of Electrical and Data Engineering, University of Technology Sydney 2Cognitive Computing Lab, Baidu Research
Pseudocode	Yes	Algorithm 1 Generalization guaranteed dataset pruning.
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code for the proposed methodology or a link to a code repository.
Open Datasets	Yes	We evaluate dataset pruning methods on CIFAR10, CIFAR100 Krizhevsky (2009), and Tiny Image Net Le & Yang (2015) datasets.
Dataset Splits	No	The paper mentions using a 'validation' set in Table 1 ("validation accuracies"), but it does not provide explicit details about the split percentages, sample counts, or the methodology for creating the training, validation, and test splits needed for reproduction.
Hardware Specification	Yes	Time cost (min) 113 113 113 113 3029 ... time of training 720 architectures on a Tesla V100 GPU
Software Dependencies	No	The paper mentions optimizers (SGD), but does not provide specific version numbers for software dependencies such as programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or GPU acceleration libraries (e.g., CUDA).
Experiment Setup	Yes	Specifically, in all experiments, we train the model for 200 epochs with a batch size of 128, a learning rate of 0.01 with cosine annealing learning rate decay strategy, SGD optimizer with the momentum of 0.9 and weight decay of 5e-4, data augmentation of random crop and random horizontal flip.