reproducibilityindex.ai

Maximal Initial Learning Rates in Deep ReLU Networks

Authors: Gaurav Iyer, Boris Hanin, David Rolnick

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we consider both empirically and theoretically how large the learning rate can be in early training. Our main contributions are as follows: ... We empirically identify a power law ... We formally prove bounds for λ1 in terms of (depth × width) that align with our empirical results.
Researcher Affiliation	Academia	1School of Computer Science, Mc Gill University, Montreal, Canada 2Mila Quebec AI Institute, Montreal, Canada 3Dept. of Operations Research & Financial Engineering, Princeton University, Princeton, USA.
Pseudocode	Yes	Algorithm 1 Maximal Initial Learning Rate η
Open Source Code	No	The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	Namely, for MNIST and CIFAR-10 we use t = 0.925 and t = 0.34 respectively. Data is sampled from two multivariate normal distributions (i.e. binary classification). The training set and validation set respectively consist of 9k and 1k samples from each distribution, leading to a total of 20k samples (with 18k samples in the training set, and 2k in the validation set).
Dataset Splits	Yes	Define threshold accuracy t... Evaluate validation accuracy a. If a >= t, then break out of inner loop, and l = m. ...The training set and validation set respectively consist of 9k and 1k samples from each distribution, leading to a total of 20k samples (with 18k samples in the training set, and 2k in the validation set).
Hardware Specification	No	The acknowledgments state: 'The authors acknowledge material support from NVIDIA and Intel in the form of computational resources and are grateful for technical support from the Mila IDT team in maintaining the Mila Compute Cluster.' This mentions vendors and a cluster, but lacks specific hardware model numbers or detailed specifications.
Software Dependencies	No	The paper mentions using 'Py Hessian (Yao et al., 2020)' for computing sharpness, but does not provide a specific version number for Py Hessian or any other software dependencies such as programming languages or frameworks.
Experiment Setup	Yes	We primarily focus on constant width, fully-connected deep Re LU networks trained with SGD, that are initialized with the Kaiming normal initialization scheme. The batch size is set to 128 across all our experiments. When using Algorithm 1, we set threshold accuracy t ... along with the number of training epochs e = 10. Upper and lower learning rate limits u and l are set heuristically; we use l = 0.0 for all our experiments, and find that s = 5 search iterations are sufficient in practice to calculate η.