Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling
Authors: Shuaipeng Li, Penghao Zhao, Hailin Zhang, Xingwu Sun, Hao Wu, Dian Jiao, Weiyan Wang, Chengjun Liu, Zheng Fang, Jinbao Xue, Yangyu Tao, Bin CUI, Di Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we raise the scaling law between batch sizes and optimal learning rates in the sign of gradient case, in which we prove that the optimal learning rate first rises and then falls as the batch size increases. Moreover, the peak value of the surge will gradually move toward the larger batch size as training progresses. Second, we conduct experiments on various CV and NLP tasks and verify the correctness of the scaling law. |
| Researcher Affiliation | Collaboration | 1 Tencent Hunyuan 2 School of Computer Science & Key Lab of High Confidence Software Technologies (MOE), Peking University 3 University of Macau 4 Institute of Computational Social Science, Peking University (Qingdao) |
| Pseudocode | No | The paper does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: The model structure and data required for the experiments are publicly available. And the necessary experimental configuration has been provided in Section 3.1 for reproduction. |
| Open Datasets | Yes | In our empirical study, we incorporate 4 open-source workloads that are extensively utilized: (1) training a 5-layer CNN model on the Fashion-MNIST [31]... (2) training a Res Net18 model [32] on the Tiny-Image Net dataset [33]... (3) training a dense Transformer model [12] (simplified Distil GPT2 [34]) on the ELI5-Category dataset [35]... (4) training a fine-grained Mixture-of-Experts (Mo E) model... on the Red Pajama-v2 dataset [39]... |
| Dataset Splits | No | To showcase the optimal learning rate for each batch size configuration, we leverage a grid-search-style experiments set. Each point in the grid search corresponds to a certain round with the same configuration but a different random number seed. |
| Hardware Specification | Yes | We execute each round of experiments utilizing an NVIDIA A100 card. |
| Software Dependencies | No | We conduct experiments using the Adam optimizer. |
| Experiment Setup | Yes | Batch sizes and learning rates. To showcase the optimal learning rate for each batch size configuration, we leverage a grid-search-style experiments set. ... The start point, stop point, and the interval of different workloads are listed in Table 1. Hyper-parameters. Since we derive the theorems on Adam-style optimizers, we conduct experiments using the Adam optimizer. We experiment on both the "sign of gradient" configuration (β1 = 0, β2 = 0) and the default hyper-parameters (β1 = 0.9, β2 = 0.999), as shown in Table 1. |