reproducibilityindex.ai

Momentum Improves Normalized SGD

Authors: Ashok Cutkosky, Harsh Mehta

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we show that our method is effective when employed on popular large scale tasks such as Res Net-50 and BERT pretraining, matching the performance of the disparate methods used to get state-of-the-art results on both tasks. and 5. Experiments Now, we turn to experimental evaluation of the proposed method NIGT on two popular large-scale deep learning benchmarks: BERT pretraining and Res Net-50.
Researcher Affiliation	Collaboration	1Google Research, California, USA 2Boston University, Massachusetts, USA. Correspondence to: Ashok Cutkosky <ashok@cutkosky.com>, Harsh Mehta <harshm@google.com>.
Pseudocode	Yes	Algorithm 1 Normalized SGD with Implicit Gradient Transport (NIGT pronounced night )
Open Source Code	No	The paper states 'We implemented our algorithm in the Tensorﬂow framework' but does not provide any link, repository, or explicit statement about making the source code publicly available for the described methodology.
Open Datasets	Yes	We train the Image Net dataset (Deng et al., 2009)
Dataset Splits	No	The paper mentions 'Masked language modeling validation accuracy' for BERT pretraining and 'Top-1 validation accuracy' for Resnet-50 on Imagenet, indicating the use of a validation set. However, it does not provide specific details on the split percentages, sample counts, or citations to predefined validation splits.
Hardware Specification	Yes	All our experiments were conducted on a TPUv3 architecture.
Software Dependencies	No	The paper states 'We implemented our algorithm in the Tensorﬂow framework' but does not provide specific version numbers for TensorFlow or any other software dependencies, which are required for reproducibility.
Experiment Setup	Yes	For simplicity, we implemented a per-layer version of our algorithm, normalizing the gradients for each layer in the network, rather than normalizing the full gradient. Taking our cue from defaults from previous empirical literature on momentum, we set the β parameter to 0.9 for NIGT for both BERT and Res Net-50. For BERT, we stick with the learning rate schedule used for Adam in (Devlin et al., 2019) i.e linear warmup and polynomial decay of ηt = η0 (1 t /T ). Whereas for Res Net-50, we found that linear warmup and polynomical decay of ηt = η0 (1 t /T )2 worked best (You et al., 2017). We performed a grid search on base learning rate η0 [10 5, 10 4, 10 3, 10 2, 10 1, 100] for both the tasks. In our implementation, we also scale the learning rate with the norm of the weights for each layer similar to (You et al., 2017). We did not normalize gradients for bias, batch normalization and layer normalization parameters, and we scaled their learning rates by a factor of 1000.