InfoBatch: Lossless Training Speed Up by Unbiased Dynamic Data Pruning

Authors: Ziheng Qin, Kai Wang, Zangwei Zheng, Jianyang Gu, Xiangyu Peng, xu Zhao Pan, Daquan Zhou, Lei Shang, Baigui Sun, Xuansong Xie, Yang You

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify the effectiveness of our method on multiple datasets: CIFAR-10/100 (Krizhevsky et al., a;b), Image Net-1K (Deng et al., 2009), ADE20K (Zhou et al., 2017) and FFHQ (Karras et al., 2019).Info Batch consistently obtains lossless training results on classification, semantic segmentation, vision pertaining, and instruction fine-tuning tasks.
Researcher Affiliation Collaboration 1National University of Singapore 2Alibaba Group {zihengq, kai.wang, youy}@comp.nus.edu.sg
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is publicly available at NUS-HPC-AI-Lab/Info Batch.
Open Datasets Yes We verify the effectiveness of our method on multiple datasets: CIFAR-10/100 (Krizhevsky et al., a;b), Image Net-1K (Deng et al., 2009), ADE20K (Zhou et al., 2017) and FFHQ (Karras et al., 2019).
Dataset Splits Yes Image Net-1K is the subset of the Image Net-21k dataset with 1,000 categories. It contains 1,281,167 training images and 50,000 validation images.
Hardware Specification Yes Results are reported with Res Net-50 under 40% prune ratio for 90 epochs on an 8-A100GPU server. We use V100 for this experiment...
Software Dependencies No The paper states using "Py Torch (Paszke et al., 2019)", "Timm (Wightman et al., 2021)", and "mmsegmentation (Contributors, 2020)" but does not specify their version numbers.
Experiment Setup Yes For Info Batch, default value r = 0.5 and δ = 0.875 are used if not specified. For classification tasks... all models are trained with One Cycle scheduler (cosine annealing)... using default setting and SGD/LARS optimizer... with momentum 0.9, weight decay 5e-4. All images are augmented with commonly adopted transformations, i.e. normalization, random crop, and horizontal flop... and "LARS use a max learning rate 2.3 for the One Cycle scheduler under the batch size of 128, and a maximum learning rate of 5.62 for a batch size of 256."