Don't Decay the Learning Rate, Increase the Batch Size
Authors: Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In section 5.1, we demonstrate that decreasing the learning rate and increasing the batch size during training are equivalent. In section 5.2, we show we can further reduce the number of parameter updates by increasing the effective learning rate and scaling the batch size. In section 5.3 we apply our insights to train Inception-Res Net-V2 on Image Net, using vast batches of up to 65536 images. Finally in section 5.4, we train Res Net-50 to 76.1% Image Net validation accuracy within 30 minutes. |
| Researcher Affiliation | Industry | Samuel L. Smith , Pieter-Jan Kindermans , Chris Ying & Quoc V. Le Google Brain {slsmith, pikinder, chrisying, qvl}@google.com |
| Pseudocode | No | The paper describes mathematical formulations and experimental procedures in narrative text, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We train Res Net-50 on Image Net to 76.1% validation accuracy in under 30 minutes. Our first experiments are performed on CIFAR-10, using a 16-4 wide Res Net architecture, following the implementation of Zagoruyko & Komodakis (2016). |
| Dataset Splits | No | The paper mentions training on CIFAR-10 (50000 training images) and ImageNet, and reports 'validation accuracy' and 'test set accuracy', implying standard splits for these public datasets. However, it does not explicitly state the percentages or sample counts for training, validation, and test dataset splits. |
| Hardware Specification | Yes | To confirm that increasing the batch size during training can reduce model training times, we replicated the set-up described by Goyal et al. (2017) on a half TPU pod, comprising 256 tensorcores (Jouppi et al., 2017). |
| Software Dependencies | No | The paper states 'Using tensor Flow' but does not specify a version number for TensorFlow or any other software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | We use ghost batch norm (Hoffer et al., 2017), with a ghost batch size of 128. Original training schedule follows the implementation of Zagoruyko & Komodakis (2016), using an initial learning rate of 0.1 which decays by a factor of 5 at each step, a momentum coefficient of 0.9, and a batch size of 128. (Many more specific values are given throughout Section 5). |