The Break-Even Point on Optimization Trajectories of Deep Neural Networks
Authors: Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho*, Krzysztof Geras*
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our main contribution is to state and present empirical evidence for two conjectures about the dependence of the entire optimization trajectory on the early phase of training. Specifically, we conjecture that the hyperparameters of stochastic gradient descent (SGD) used before reaching the break-even point control: (1) the spectral norms of K and H, and (2) the conditioning of K and H. In particular, using a larger learning rate prior to reaching the break-even point reduces the spectral norm of K along the optimization trajectory (see Fig. 1 for an illustration of this phenomenon). Reducing the spectral norm of K decreases the variance of the mini-batch gradient, which has been linked to improved convergence speed (Johnson & Zhang, 2013). |
| Researcher Affiliation | Collaboration | Stanisław Jastrz ebski1 , Maciej Szymczak2, Stanislav Fort3, Devansh Arpit4, Jacek Tabor2, Kyunghyun Cho1,5,6 , Krzysztof Geras1 1New York University, USA 2Jagiellonian University, Poland 3Stanford University, USA 4Salesforce Research, USA 5Facebook AI Research, USA 6CIFAR Azrieli Global Scholar |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code for the methodology described, nor does it provide a direct link to such code. It does link to a BERT model's pretrained weights, but that is not the authors' own source code for their method. |
| Open Datasets | Yes | We run experiments on the following datasets: CIFAR-10 (Krizhevsky, 2009), IMDB dataset (Maas et al., 2011), Image Net (Deng et al., 2009), and MNLI (Williams et al., 2018). |
| Dataset Splits | Yes | The training and validation accuracies are reported for all the experiments in App. E. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for its experiments (e.g., GPU models, CPU types, or memory specifications). It only mentions 'reasonable computational budget'. |
| Software Dependencies | No | The paper mentions 'SCIPY Python package' and 'Num Py package' without specifying version numbers. It also refers to 'Keras example' without a version for Keras. No specific version numbers for key software components are provided. |
| Experiment Setup | Yes | We run experiments on the following datasets: CIFAR-10 (Krizhevsky, 2009), IMDB dataset (Maas et al., 2011), Image Net (Deng et al., 2009), and MNLI (Williams et al., 2018). We apply to these datasets the following architectures: a vanilla CNN (Simple CNN) following Keras example (Chollet et al., 2015), Res Net-32 (He et al., 2015), LSTM (Hochreiter & Schmidhuber, 1997), Dense Net (Huang et al., 2016), and BERT (Devlin et al., 2018). ... Res Net-32 (He et al., 2015) is trained for 200 epochs with a batch size equal to 128 on the CIFAR-10 dataset. Standard data augmentation and preprocessing is applied. Following He et al. (2015), we regularize the model using weight decay 0.0001. We apply weight decay to all convolutional kernels. When varying the batch size, we use learning rate of 0.05. When varying the learning rate, we use batch size of 128. ... The model is trained for 20 epochs using a batch size of 32. ... The model is trained for 100 epochs. When varying the learning rate, we use batch size of 128. When varying the batch size, we use learning rate of 1.0. ... The network is trained only for 10 epochs using a batch size of 32. |