Maximal Initial Learning Rates in Deep ReLU Networks
Authors: Gaurav Iyer, Boris Hanin, David Rolnick
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we consider both empirically and theoretically how large the learning rate can be in early training. Our main contributions are as follows: ... We empirically identify a power law ... We formally prove bounds for λ1 in terms of (depth × width) that align with our empirical results. |
| Researcher Affiliation | Academia | 1School of Computer Science, Mc Gill University, Montreal, Canada 2Mila Quebec AI Institute, Montreal, Canada 3Dept. of Operations Research & Financial Engineering, Princeton University, Princeton, USA. |
| Pseudocode | Yes | Algorithm 1 Maximal Initial Learning Rate η |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | Namely, for MNIST and CIFAR-10 we use t = 0.925 and t = 0.34 respectively. Data is sampled from two multivariate normal distributions (i.e. binary classification). The training set and validation set respectively consist of 9k and 1k samples from each distribution, leading to a total of 20k samples (with 18k samples in the training set, and 2k in the validation set). |
| Dataset Splits | Yes | Define threshold accuracy t... Evaluate validation accuracy a. If a >= t, then break out of inner loop, and l = m. ...The training set and validation set respectively consist of 9k and 1k samples from each distribution, leading to a total of 20k samples (with 18k samples in the training set, and 2k in the validation set). |
| Hardware Specification | No | The acknowledgments state: 'The authors acknowledge material support from NVIDIA and Intel in the form of computational resources and are grateful for technical support from the Mila IDT team in maintaining the Mila Compute Cluster.' This mentions vendors and a cluster, but lacks specific hardware model numbers or detailed specifications. |
| Software Dependencies | No | The paper mentions using 'Py Hessian (Yao et al., 2020)' for computing sharpness, but does not provide a specific version number for Py Hessian or any other software dependencies such as programming languages or frameworks. |
| Experiment Setup | Yes | We primarily focus on constant width, fully-connected deep Re LU networks trained with SGD, that are initialized with the Kaiming normal initialization scheme. The batch size is set to 128 across all our experiments. When using Algorithm 1, we set threshold accuracy t ... along with the number of training epochs e = 10. Upper and lower learning rate limits u and l are set heuristically; we use l = 0.0 for all our experiments, and find that s = 5 search iterations are sufficient in practice to calculate η. |