reproducibilityindex.ai

Where Do Large Learning Rates Lead Us?

Authors: Ildus Sadrtdinov, Maxim Kodryan, Eduard Pokonechny, Ekaterina Lobacheva, Dmitry P. Vetrov

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Following a line of research devoted to understanding this effect, we conduct an empirical study in a controlled setting focusing on two questions: 1) how large an initial LR is required for obtaining optimal quality, and 2) what are the key differences between models trained with different LRs?
Researcher Affiliation	Collaboration	Ildus Sadrtdinov1,2 , Maxim Kodryan2 , Eduard Pokonechny3 , Ekaterina Lobacheva3 , Dmitry Vetrov1 1 Constructor University, Bremen 2 HSE University 3 Independent researcher
Pseudocode	No	The paper describes methods in text and figures, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/isadrtdinov/understanding-large-lrs.
Open Datasets	Yes	All results are obtained with a scale-invariant Res Net-18 [23] trained on CIFAR-10 [37]. We additionally consider a plain convolutional network Conv Net and Vi T small architecture [16] as well as CIFAR-100 and Tiny Image Net [39] datasets in the appendix.
Dataset Splits	No	For the synthetic example, the paper states 'We take 512 training and 2000 testing samples.' It does not explicitly specify validation splits for any dataset, nor does it provide the exact train/test splits for CIFAR-10, CIFAR-100, or Tiny Image Net.
Hardware Specification	Yes	We use NVIDIA TESLA V100 and A100 GPUs for computations in our experiments.
Software Dependencies	No	The paper provides links to code repositories, but it does not explicitly list software dependencies with version numbers (e.g., 'PyTorch 1.9', 'CUDA 11.1').
Experiment Setup	Yes	We train all networks using SGD with a batch size of 128. Both the pre-training and the fine-tuning stages take 200 epochs... In the practical setting in Section 6, we use weight decay of 5 10 4, momentum of 0.9, and standard augmentations: random crops (size: 32 for CIFAR and 64 for Tiny Image Net, padding: 4), random horizontal flips, and per-channel normalization.