Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

Authors: Zhiyuan Li, Kaifeng Lyu, Sanjeev Arora

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental New experiments, backed by mathematical intuition, suggesting the number of steps to equilibrium (in function space) scales as the inverse of the intrinsic learning rate, as opposed to the exponential time convergence bound implied by SDE analysis. We name it the Fast Equilibrium Conjecture and suggest it holds the key to why Batch Normalization is effective.
Researcher Affiliation Academia Zhiyuan Li Princeton University zhiyuanli@cs.princeton.edu Kaifeng Lyu Tsinghua University vfleaking@gmail.com Sanjeev Arora Princeton University & IAS arora@cs.princeton.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes MNIST Experiments. We use a simple 4-layer CNN for MNIST.
Dataset Splits No The paper mentions using MNIST and CIFAR-10 datasets and discusses train/test errors, but it does not specify the exact split percentages, sample counts, or a detailed splitting methodology for training, validation, and test sets needed to reproduce the data partitioning.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes The initial learning rate is 0.1, initial WD factor is 0.0005. The label wd_x_y_lr_z_u means dividing WD factor by 10 at epoch x and y, and dividing LR by 10 at epoch z and u. For example, the blue line means dividing LR by 10 twice at epoch 0, i.e. using an initial LR of 0.001 and dividing LR by 10 at epoch 5000.