Multirate Training of Neural Networks

Authors: Tiffany J Vlaar, Benedict Leimkuhler

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show for applications in vision and NLP that we can fine-tune deep neural networks in almost half the time, without reducing the generalization performance of the resulting models. We analyze the convergence properties of our multirate scheme and draw a comparison with vanilla SGD. We also discuss splitting choices for the neural network parameters which could enhance generalization performance when neural networks are trained from scratch. A multirate approach can be used to learn different features present in the data and as a form of regularization.
Researcher Affiliation Academia Tiffany Vlaar 1 Benedict Leimkuhler 1 1Department of Mathematics, University of Edinburgh, Edinburgh, United Kingdom. Correspondence to: Tiffany Vlaar <Tiffany.Vlaar@ed.ac.uk>.
Pseudocode Yes Algorithm 1 Multirate SGD with linear drift p S := µp S + θSL(θS, θF ) for i = 1, 2, ..., k do p F := µp F + θF L(θS, θF ) θF := θF h k p F θS := θS h k p S end for Algorithm 2 Multirate SGD no linear drift p S := µp S + θSL(θS, θF ) θS := θS hp S for i = 1, 2, ..., k do p F := µp F + θF L(θS, θF ) θF := θF h k p F end for
Open Source Code Yes Py Torch code supporting this work, including a ready-to-use torch.optimizer, has been made available at https://github.com/ Tiffany Vlaar/Multirate Training Of NNs.
Open Datasets Yes To demonstrate how multirate methods may be applicable in deep learning applications, consider a Wide Res Net-16 architecture trained on the patch-augmented CIFAR-10 dataset (Li et al., 2019) using SGD with momentum and weight decay and different learning rates (Figure 2). In this dataset a noisy patch of 7 7 pixels is added to the center of some CIFAR-10 images. Some images contain both the patch and CIFAR-10 data, while other images only contain the patch or are patch-free. ... We consider a Res Net-34 architecture (He et al., 2016), which has been pre-trained on Image Net (Paszke et al., 2017), to classify CIFAR-10 data (Krizhevsky & Hinton, 2009). ...We also test our multirate approach on natural language data and consider a pre-trained Distil BERT (obtained from Hugging Face, transformers library). We fine-tune Distil BERT on SST-2 data (Socher et al., 2013)...
Dataset Splits Yes We compare our multirate approach (blue) to different finetuning approaches in Figure 4. Our multirate approach can be used to train the net in almost half the time, without reducing the test accuracy of the resulting net. We show in Figure 5 and Figure A10 in Appendix A that the same observations hold when training using linear learning rate decay or weight decay, respectively. In Figure 6 we repeat the experiment for a Res Net-50 architecture (pre-trained on Image Net), which is fine-tuned on CIFAR-100 data, and observe the same behaviour.
Hardware Specification Yes We performed our experiments in Py Torch on NVIDIA DGX-1 GPUs.
Software Dependencies No The paper mentions Py Torch but does not specify its version number or any other software dependencies with their versions.
Experiment Setup Yes We use as base algorithm SGD with momentum and performed a hyperparameter search to select the optimal learning rate for full network fine-tuning. ...We set h/k = 0.001, k = 5, and µ = 0.9 in Algorithm 1. ... We set h/k = 1e-4, k = 5, µ = 0.9 in Algorithm 1, batchsize = 16, and average results over 10 runs.