Theoretical Analysis of Auto Rate-Tuning by Batch Normalization

Authors: Sanjeev Arora, Zhiyuan Li, Kaifeng Lyu

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here theoretical support is provided for one of its conjectured properties, namely, the ability to allow gradient descent to succeed with less tuning of learning rates. It is shown that even if we fix the learning rate of scale-invariant parameters (e.g., weights of each layer with BN) to a constant (say, 0.3), gradient descent still approaches a stationary point (i.e., a solution where gradient is zero) in the rate of T 1/2 in T iterations, asymptotically matching the best bound for gradient descent with well-tuned learning rates. A similar result with convergence rate T 1/4 is also shown for stochastic gradient descent. We include some experiments in Appendix D, showing that it is indeed the auto-tuning behavior we analysed in this paper empowers BN to have such convergence with arbitrary learning rate for scale-invariant parameters.
Researcher Affiliation Academia Sanjeev Arora Princeton University and Institute for Advanced Study arora@cs.princeton.edu Zhiyuan Li Princeton University zhiyuanli@cs.princeton.edu Kaifeng Lyu Tsinghus University lkf15@mails.tsinghua.edu.cn
Pseudocode No The paper provides mathematical equations for updates (e.g., Equation 4, Equation 9) but does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code No The paper does not provide any statement about releasing the source code for the described methodology or a link to a code repository.
Open Datasets Yes We trained a modified version of VGGNet (Simonyan & Zisserman, 2014) on Tensorflow. In this network, every kernel is scale-invariant, and for every BN layer except the last one, the concatenation of all β and γ parameters in this BN is also scale-invariant. Only β and γ parameters in the last BN are scale-variant (See Section 2.1). We consider the training in following two settings: [...] We train the network in either setting with different learning rates ranging from 10 2 to 102 for 100 epochs.
Dataset Splits No The paper mentions 'training loss' and 'test accuracy' and uses the CIFAR-10 dataset. However, it does not specify any explicit train/validation/test dataset splits (e.g., percentages or sample counts), nor does it reference predefined splits with citations for reproducibility of data partitioning.
Hardware Specification No The paper states: 'We thank Amazon Web Services for providing compute time for the experiments in this paper.' This indicates a cloud provider was used but does not provide specific hardware details such as GPU models, CPU types, or instance configurations necessary to reproduce the hardware environment.
Software Dependencies No The paper mentions 'We trained a modified version of VGGNet (Simonyan & Zisserman, 2014) on Tensorflow.' It specifies 'Tensorflow' as the framework, but does not provide any version number for TensorFlow or any other software component used in the experiments.
Experiment Setup Yes We initialize the parameters according to the default configuration in Tensorflow: all the weights are initialized by Glorot uniform initializer (Glorot & Bengio, 2010); β and γ in BN are initialized by 0 and 1, respectively. We set ϵ = 0 in each BN, since we observed that the network works equally well for ϵ being 0 or an small number (such as 10 3, the default value in Tensorflow). We train the network in either setting with different learning rates ranging from 10 2 to 102 for 100 epochs.