Momentum Improves Normalized SGD
Authors: Ashok Cutkosky, Harsh Mehta
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we show that our method is effective when employed on popular large scale tasks such as Res Net-50 and BERT pretraining, matching the performance of the disparate methods used to get state-of-the-art results on both tasks. and 5. Experiments Now, we turn to experimental evaluation of the proposed method NIGT on two popular large-scale deep learning benchmarks: BERT pretraining and Res Net-50. |
| Researcher Affiliation | Collaboration | 1Google Research, California, USA 2Boston University, Massachusetts, USA. Correspondence to: Ashok Cutkosky <ashok@cutkosky.com>, Harsh Mehta <harshm@google.com>. |
| Pseudocode | Yes | Algorithm 1 Normalized SGD with Implicit Gradient Transport (NIGT pronounced night ) |
| Open Source Code | No | The paper states 'We implemented our algorithm in the Tensorflow framework' but does not provide any link, repository, or explicit statement about making the source code publicly available for the described methodology. |
| Open Datasets | Yes | We train the Image Net dataset (Deng et al., 2009) |
| Dataset Splits | No | The paper mentions 'Masked language modeling validation accuracy' for BERT pretraining and 'Top-1 validation accuracy' for Resnet-50 on Imagenet, indicating the use of a validation set. However, it does not provide specific details on the split percentages, sample counts, or citations to predefined validation splits. |
| Hardware Specification | Yes | All our experiments were conducted on a TPUv3 architecture. |
| Software Dependencies | No | The paper states 'We implemented our algorithm in the Tensorflow framework' but does not provide specific version numbers for TensorFlow or any other software dependencies, which are required for reproducibility. |
| Experiment Setup | Yes | For simplicity, we implemented a per-layer version of our algorithm, normalizing the gradients for each layer in the network, rather than normalizing the full gradient. Taking our cue from defaults from previous empirical literature on momentum, we set the β parameter to 0.9 for NIGT for both BERT and Res Net-50. For BERT, we stick with the learning rate schedule used for Adam in (Devlin et al., 2019) i.e linear warmup and polynomial decay of ηt = η0 (1 t /T ). Whereas for Res Net-50, we found that linear warmup and polynomical decay of ηt = η0 (1 t /T )2 worked best (You et al., 2017). We performed a grid search on base learning rate η0 [10 5, 10 4, 10 3, 10 2, 10 1, 100] for both the tasks. In our implementation, we also scale the learning rate with the norm of the weights for each layer similar to (You et al., 2017). We did not normalize gradients for bias, batch normalization and layer normalization parameters, and we scaled their learning rates by a factor of 1000. |