Efficient Backpropagation with Variance Controlled Adaptive Sampling

Authors: Ziteng Wang, Jianfei Chen, Jun Zhu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We assessed VCAS on multiple fine-tuning and pre-training tasks in both vision and natural language domains. On all the tasks, VCAS can preserve the original training loss trajectory and validation accuracy with an up to 73.87% FLOPs reduction of BP and 49.58% FLOPs reduction of the whole training process.
Researcher Affiliation Academia Ziteng Wang, Jianfei Chen1, Jun Zhu Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University wangzite23@mails.tsinghua.edu.cn; {jianfeic, dcszj}@tsinghua.edu.cn
Pseudocode Yes Algorithm 1 Variance controlled adaptive sampling(VCAS) for backpropagation
Open Source Code Yes The implementation is available at https://github.com/thu-ml/VCAS.
Open Datasets Yes We assessed VCAS on multiple fine-tuning and pre-training tasks in both vision and natural language domains. On all the tasks, VCAS can preserve the original training loss trajectory and validation accuracy... [lists datasets such as C4, MNLI-m, QQP, QNLI, SST-2, CIFAR10, CIFAR100, Image Net-1k]
Dataset Splits Yes On all the tasks, VCAS can preserve the original training loss trajectory and validation accuracy... [references common benchmark datasets like MNLI, QQP, QNLI, SST-2, CIFAR10, CIFAR100, Image Net-1k, which have predefined splits]
Hardware Specification Yes We record the wall-clock time of BERT-large finetuning on MNLI and Vi T-large finetuning on Image Net-1k with NVIDIA 3090Ti
Software Dependencies No No specific software dependencies with version numbers (e.g., library names with specific versions like PyTorch 1.x or CUDA 11.x) are provided.
Experiment Setup Yes for all these experiments we use the same conservative setting of τact = τw = 0.025, α = 0.01, β = 0.95, M = 2. We preset all these values heuristically without any tuning or prior knowledge. The only hyperpamater we modified among different tasks is the variance calculation frequency F, which can be defined easily according to the total training steps. For BERT finetuning, we use Adam W optimizer with lr = 2e 5 and wd = 0.01. The learning rate scheduler is a linear one with warmup ratio = 0.1. We set epoch numbers N = 3 and a batch size of batch size = 32. For Vi T finetuning, we use Adam optimizer with lr = 2e 5. A linear lr scheduler with no warmup employed. We run N = 5 epochs with batch size batch size = 32