reproducibilityindex.ai

Theoretical Analysis of Auto Rate-Tuning by Batch Normalization

Authors: Sanjeev Arora, Zhiyuan Li, Kaifeng Lyu

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Here theoretical support is provided for one of its conjectured properties, namely, the ability to allow gradient descent to succeed with less tuning of learning rates. It is shown that even if we ﬁx the learning rate of scale-invariant parameters (e.g., weights of each layer with BN) to a constant (say, 0.3), gradient descent still approaches a stationary point (i.e., a solution where gradient is zero) in the rate of T 1/2 in T iterations, asymptotically matching the best bound for gradient descent with well-tuned learning rates. A similar result with convergence rate T 1/4 is also shown for stochastic gradient descent. We include some experiments in Appendix D, showing that it is indeed the auto-tuning behavior we analysed in this paper empowers BN to have such convergence with arbitrary learning rate for scale-invariant parameters.
Researcher Affiliation	Academia	Sanjeev Arora Princeton University and Institute for Advanced Study arora@cs.princeton.edu Zhiyuan Li Princeton University zhiyuanli@cs.princeton.edu Kaifeng Lyu Tsinghus University lkf15@mails.tsinghua.edu.cn
Pseudocode	No	The paper provides mathematical equations for updates (e.g., Equation 4, Equation 9) but does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code	No	The paper does not provide any statement about releasing the source code for the described methodology or a link to a code repository.
Open Datasets	Yes	We trained a modiﬁed version of VGGNet (Simonyan & Zisserman, 2014) on Tensorﬂow. In this network, every kernel is scale-invariant, and for every BN layer except the last one, the concatenation of all β and γ parameters in this BN is also scale-invariant. Only β and γ parameters in the last BN are scale-variant (See Section 2.1). We consider the training in following two settings: [...] We train the network in either setting with different learning rates ranging from 10 2 to 102 for 100 epochs.
Dataset Splits	No	The paper mentions 'training loss' and 'test accuracy' and uses the CIFAR-10 dataset. However, it does not specify any explicit train/validation/test dataset splits (e.g., percentages or sample counts), nor does it reference predefined splits with citations for reproducibility of data partitioning.
Hardware Specification	No	The paper states: 'We thank Amazon Web Services for providing compute time for the experiments in this paper.' This indicates a cloud provider was used but does not provide specific hardware details such as GPU models, CPU types, or instance configurations necessary to reproduce the hardware environment.
Software Dependencies	No	The paper mentions 'We trained a modiﬁed version of VGGNet (Simonyan & Zisserman, 2014) on Tensorﬂow.' It specifies 'Tensorﬂow' as the framework, but does not provide any version number for TensorFlow or any other software component used in the experiments.
Experiment Setup	Yes	We initialize the parameters according to the default conﬁguration in Tensorﬂow: all the weights are initialized by Glorot uniform initializer (Glorot & Bengio, 2010); β and γ in BN are initialized by 0 and 1, respectively. We set ϵ = 0 in each BN, since we observed that the network works equally well for ϵ being 0 or an small number (such as 10 3, the default value in Tensorﬂow). We train the network in either setting with different learning rates ranging from 10 2 to 102 for 100 epochs.