Efficient Backpropagation with Variance Controlled Adaptive Sampling
Authors: Ziteng Wang, Jianfei Chen, Jun Zhu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We assessed VCAS on multiple fine-tuning and pre-training tasks in both vision and natural language domains. On all the tasks, VCAS can preserve the original training loss trajectory and validation accuracy with an up to 73.87% FLOPs reduction of BP and 49.58% FLOPs reduction of the whole training process. |
| Researcher Affiliation | Academia | Ziteng Wang, Jianfei Chen1, Jun Zhu Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University wangzite23@mails.tsinghua.edu.cn; {jianfeic, dcszj}@tsinghua.edu.cn |
| Pseudocode | Yes | Algorithm 1 Variance controlled adaptive sampling(VCAS) for backpropagation |
| Open Source Code | Yes | The implementation is available at https://github.com/thu-ml/VCAS. |
| Open Datasets | Yes | We assessed VCAS on multiple fine-tuning and pre-training tasks in both vision and natural language domains. On all the tasks, VCAS can preserve the original training loss trajectory and validation accuracy... [lists datasets such as C4, MNLI-m, QQP, QNLI, SST-2, CIFAR10, CIFAR100, Image Net-1k] |
| Dataset Splits | Yes | On all the tasks, VCAS can preserve the original training loss trajectory and validation accuracy... [references common benchmark datasets like MNLI, QQP, QNLI, SST-2, CIFAR10, CIFAR100, Image Net-1k, which have predefined splits] |
| Hardware Specification | Yes | We record the wall-clock time of BERT-large finetuning on MNLI and Vi T-large finetuning on Image Net-1k with NVIDIA 3090Ti |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., library names with specific versions like PyTorch 1.x or CUDA 11.x) are provided. |
| Experiment Setup | Yes | for all these experiments we use the same conservative setting of τact = τw = 0.025, α = 0.01, β = 0.95, M = 2. We preset all these values heuristically without any tuning or prior knowledge. The only hyperpamater we modified among different tasks is the variance calculation frequency F, which can be defined easily according to the total training steps. For BERT finetuning, we use Adam W optimizer with lr = 2e 5 and wd = 0.01. The learning rate scheduler is a linear one with warmup ratio = 0.1. We set epoch numbers N = 3 and a batch size of batch size = 32. For Vi T finetuning, we use Adam optimizer with lr = 2e 5. A linear lr scheduler with no warmup employed. We run N = 5 epochs with batch size batch size = 32 |