Where Do Large Learning Rates Lead Us?
Authors: Ildus Sadrtdinov, Maxim Kodryan, Eduard Pokonechny, Ekaterina Lobacheva, Dmitry P. Vetrov
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Following a line of research devoted to understanding this effect, we conduct an empirical study in a controlled setting focusing on two questions: 1) how large an initial LR is required for obtaining optimal quality, and 2) what are the key differences between models trained with different LRs? |
| Researcher Affiliation | Collaboration | Ildus Sadrtdinov1,2 , Maxim Kodryan2 , Eduard Pokonechny3 , Ekaterina Lobacheva3 , Dmitry Vetrov1 1 Constructor University, Bremen 2 HSE University 3 Independent researcher |
| Pseudocode | No | The paper describes methods in text and figures, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/isadrtdinov/understanding-large-lrs. |
| Open Datasets | Yes | All results are obtained with a scale-invariant Res Net-18 [23] trained on CIFAR-10 [37]. We additionally consider a plain convolutional network Conv Net and Vi T small architecture [16] as well as CIFAR-100 and Tiny Image Net [39] datasets in the appendix. |
| Dataset Splits | No | For the synthetic example, the paper states 'We take 512 training and 2000 testing samples.' It does not explicitly specify validation splits for any dataset, nor does it provide the exact train/test splits for CIFAR-10, CIFAR-100, or Tiny Image Net. |
| Hardware Specification | Yes | We use NVIDIA TESLA V100 and A100 GPUs for computations in our experiments. |
| Software Dependencies | No | The paper provides links to code repositories, but it does not explicitly list software dependencies with version numbers (e.g., 'PyTorch 1.9', 'CUDA 11.1'). |
| Experiment Setup | Yes | We train all networks using SGD with a batch size of 128. Both the pre-training and the fine-tuning stages take 200 epochs... In the practical setting in Section 6, we use weight decay of 5 10 4, momentum of 0.9, and standard augmentations: random crops (size: 32 for CIFAR and 64 for Tiny Image Net, padding: 4), random horizontal flips, and per-channel normalization. |