reproducibilityindex.ai

Efficient Backpropagation with Variance Controlled Adaptive Sampling

Authors: Ziteng Wang, Jianfei Chen, Jun Zhu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We assessed VCAS on multiple fine-tuning and pre-training tasks in both vision and natural language domains. On all the tasks, VCAS can preserve the original training loss trajectory and validation accuracy with an up to 73.87% FLOPs reduction of BP and 49.58% FLOPs reduction of the whole training process.
Researcher Affiliation	Academia	Ziteng Wang, Jianfei Chen1, Jun Zhu Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University wangzite23@mails.tsinghua.edu.cn; {jianfeic, dcszj}@tsinghua.edu.cn
Pseudocode	Yes	Algorithm 1 Variance controlled adaptive sampling(VCAS) for backpropagation
Open Source Code	Yes	The implementation is available at https://github.com/thu-ml/VCAS.
Open Datasets	Yes	We assessed VCAS on multiple fine-tuning and pre-training tasks in both vision and natural language domains. On all the tasks, VCAS can preserve the original training loss trajectory and validation accuracy... [lists datasets such as C4, MNLI-m, QQP, QNLI, SST-2, CIFAR10, CIFAR100, Image Net-1k]
Dataset Splits	Yes	On all the tasks, VCAS can preserve the original training loss trajectory and validation accuracy... [references common benchmark datasets like MNLI, QQP, QNLI, SST-2, CIFAR10, CIFAR100, Image Net-1k, which have predefined splits]
Hardware Specification	Yes	We record the wall-clock time of BERT-large finetuning on MNLI and Vi T-large finetuning on Image Net-1k with NVIDIA 3090Ti
Software Dependencies	No	No specific software dependencies with version numbers (e.g., library names with specific versions like PyTorch 1.x or CUDA 11.x) are provided.
Experiment Setup	Yes	for all these experiments we use the same conservative setting of τact = τw = 0.025, α = 0.01, β = 0.95, M = 2. We preset all these values heuristically without any tuning or prior knowledge. The only hyperpamater we modified among different tasks is the variance calculation frequency F, which can be defined easily according to the total training steps. For BERT finetuning, we use Adam W optimizer with lr = 2e 5 and wd = 0.01. The learning rate scheduler is a linear one with warmup ratio = 0.1. We set epoch numbers N = 3 and a batch size of batch size = 32. For Vi T finetuning, we use Adam optimizer with lr = 2e 5. A linear lr scheduler with no warmup employed. We run N = 5 epochs with batch size batch size = 32