Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling

Authors: Shuaipeng Li, Penghao Zhao, Hailin Zhang, Xingwu Sun, Hao Wu, Dian Jiao, Weiyan Wang, Chengjun Liu, Zheng Fang, Jinbao Xue, Yangyu Tao, Bin CUI, Di Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental First, we raise the scaling law between batch sizes and optimal learning rates in the sign of gradient case, in which we prove that the optimal learning rate first rises and then falls as the batch size increases. Moreover, the peak value of the surge will gradually move toward the larger batch size as training progresses. Second, we conduct experiments on various CV and NLP tasks and verify the correctness of the scaling law.
Researcher Affiliation Collaboration 1 Tencent Hunyuan 2 School of Computer Science & Key Lab of High Confidence Software Technologies (MOE), Peking University 3 University of Macau 4 Institute of Computational Social Science, Peking University (Qingdao)
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: The model structure and data required for the experiments are publicly available. And the necessary experimental configuration has been provided in Section 3.1 for reproduction.
Open Datasets Yes In our empirical study, we incorporate 4 open-source workloads that are extensively utilized: (1) training a 5-layer CNN model on the Fashion-MNIST [31]... (2) training a Res Net18 model [32] on the Tiny-Image Net dataset [33]... (3) training a dense Transformer model [12] (simplified Distil GPT2 [34]) on the ELI5-Category dataset [35]... (4) training a fine-grained Mixture-of-Experts (Mo E) model... on the Red Pajama-v2 dataset [39]...
Dataset Splits No To showcase the optimal learning rate for each batch size configuration, we leverage a grid-search-style experiments set. Each point in the grid search corresponds to a certain round with the same configuration but a different random number seed.
Hardware Specification Yes We execute each round of experiments utilizing an NVIDIA A100 card.
Software Dependencies No We conduct experiments using the Adam optimizer.
Experiment Setup Yes Batch sizes and learning rates. To showcase the optimal learning rate for each batch size configuration, we leverage a grid-search-style experiments set. ... The start point, stop point, and the interval of different workloads are listed in Table 1. Hyper-parameters. Since we derive the theorems on Adam-style optimizers, we conduct experiments using the Adam optimizer. We experiment on both the "sign of gradient" configuration (β1 = 0, β2 = 0) and the default hyper-parameters (β1 = 0.9, β2 = 0.999), as shown in Table 1.