On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Authors: Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate the cause for this generalization drop in the large-batch regime and present numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions and as is well known, sharp minima lead to poorer generalization. |
| Researcher Affiliation | Collaboration | Nitish Shirish Keskar Northwestern University Evanston, IL 60208 keskar.nitish@northwestern.edu Dheevatsa Mudigere Intel Corporation Bangalore, India dheevatsa.mudigere@intel.com Jorge Nocedal Northwestern University Evanston, IL 60208 j-nocedal@northwestern.edu Mikhail Smelyanskiy Intel Corporation Santa Clara, CA 95054 mikhail.smelyanskiy@intel.com Ping Tak Peter Tang Intel Corporation Santa Clara, CA 95054 peter.tang@intel.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code to reproduce the parametric plot on exemplary networks can be found in our Git Hub repository: https://github.com/keskarnitish/large-batch-training. |
| Open Datasets | Yes | Table 5: Data Sets lists MNIST (Le Cun et al., 1998a;b), TIMIT (Garofolo et al., 1993), CIFAR-10 (Krizhevsky & Hinton, 2009), CIFAR-100 (Krizhevsky & Hinton, 2009). |
| Dataset Splits | No | The paper does not explicitly state the use of a validation dataset split. Table 5 only provides 'Train' and 'Test' data points. While early stopping is mentioned, there's no clear description of a separate validation set. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU models, or memory specifications. |
| Software Dependencies | No | The paper mentions software components and optimizers like 'ADAM', 'ADAGRAD', 'SGD', 'ada QN', and 'Kaldi', but it does not specify version numbers for these components. |
| Experiment Setup | Yes | For all experiments, we used 10% of the training data as batch size for the large-batch experiments and 256 data points for small-batch experiments. We used the ADAM optimizer for both regimes. All experiments were conducted 5 times from different (uniformly distributed random) starting points and we report both mean and standard-deviation of measured quantities. The networks were trained, without any budget or limits, until the loss function ceased to improve. Appendix B details network architectures including layers, neurons, activation functions (ReLU, Softmax), batch normalization, and dropout (0.5 retention probability). |