reproducibilityindex.ai

Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling

Authors: Shuaipeng Li, Penghao Zhao, Hailin Zhang, Xingwu Sun, Hao Wu, Dian Jiao, Weiyan Wang, Chengjun Liu, Zheng Fang, Jinbao Xue, Yangyu Tao, Bin CUI, Di Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	First, we raise the scaling law between batch sizes and optimal learning rates in the sign of gradient case, in which we prove that the optimal learning rate first rises and then falls as the batch size increases. Moreover, the peak value of the surge will gradually move toward the larger batch size as training progresses. Second, we conduct experiments on various CV and NLP tasks and verify the correctness of the scaling law.
Researcher Affiliation	Collaboration	1 Tencent Hunyuan 2 School of Computer Science & Key Lab of High Confidence Software Technologies (MOE), Peking University 3 University of Macau 4 Institute of Computational Social Science, Peking University (Qingdao)
Pseudocode	No	The paper does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [NA] Justification: The model structure and data required for the experiments are publicly available. And the necessary experimental configuration has been provided in Section 3.1 for reproduction.
Open Datasets	Yes	In our empirical study, we incorporate 4 open-source workloads that are extensively utilized: (1) training a 5-layer CNN model on the Fashion-MNIST [31]... (2) training a Res Net18 model [32] on the Tiny-Image Net dataset [33]... (3) training a dense Transformer model [12] (simplified Distil GPT2 [34]) on the ELI5-Category dataset [35]... (4) training a fine-grained Mixture-of-Experts (Mo E) model... on the Red Pajama-v2 dataset [39]...
Dataset Splits	No	To showcase the optimal learning rate for each batch size configuration, we leverage a grid-search-style experiments set. Each point in the grid search corresponds to a certain round with the same configuration but a different random number seed.
Hardware Specification	Yes	We execute each round of experiments utilizing an NVIDIA A100 card.
Software Dependencies	No	We conduct experiments using the Adam optimizer.
Experiment Setup	Yes	Batch sizes and learning rates. To showcase the optimal learning rate for each batch size configuration, we leverage a grid-search-style experiments set. ... The start point, stop point, and the interval of different workloads are listed in Table 1. Hyper-parameters. Since we derive the theorems on Adam-style optimizers, we conduct experiments using the Adam optimizer. We experiment on both the "sign of gradient" configuration (β1 = 0, β2 = 0) and the default hyper-parameters (β1 = 0.9, β2 = 0.999), as shown in Table 1.