On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Authors: Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions and as is well known, sharp minima lead to poorer generalization.
Researcher Affiliation Collaboration Nitish Shirish Keskar Northwestern University Evanston, IL 60208 keskar.nitish@northwestern.edu Dheevatsa Mudigere Intel Corporation Bangalore, India dheevatsa.mudigere@intel.com Jorge Nocedal Northwestern University Evanston, IL 60208 j-nocedal@northwestern.edu Mikhail Smelyanskiy Intel Corporation Santa Clara, CA 95054 mikhail.smelyanskiy@intel.com Ping Tak Peter Tang Intel Corporation Santa Clara, CA 95054 peter.tang@intel.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code to reproduce the parametric plot on exemplary networks can be found in our Git Hub repository: https://github.com/keskarnitish/large-batch-training.
Open Datasets Yes Table 5: Data Sets lists MNIST (Le Cun et al., 1998a;b), TIMIT (Garofolo et al., 1993), CIFAR-10 (Krizhevsky & Hinton, 2009), CIFAR-100 (Krizhevsky & Hinton, 2009).
Dataset Splits No The paper does not explicitly state the use of a validation dataset split. Table 5 only provides 'Train' and 'Test' data points. While early stopping is mentioned, there's no clear description of a separate validation set.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU models, or memory specifications.
Software Dependencies No The paper mentions software components and optimizers like 'ADAM', 'ADAGRAD', 'SGD', 'ada QN', and 'Kaldi', but it does not specify version numbers for these components.
Experiment Setup Yes For all experiments, we used 10% of the training data as batch size for the large-batch experiments and 256 data points for small-batch experiments. We used the ADAM optimizer for both regimes. All experiments were conducted 5 times from different (uniformly distributed random) starting points and we report both mean and standard-deviation of measured quantities. The networks were trained, without any budget or limits, until the loss function ceased to improve. Appendix B details network architectures including layers, neurons, activation functions (ReLU, Softmax), batch normalization, and dropout (0.5 retention probability).