Better SGD using Second-order Momentum

Authors: Hoang Tran, Ashok Cutkosky

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our algorithm not only enjoys optimal theoretical properties, it is also practically effective, as demonstrated through our experimental results across various deep learning tasks.
Researcher Affiliation Academia Hoang Tran Boston University tranhp@bu.edu Ashok Cutkosky Boston University ashok@cutkosky.com
Pseudocode Yes Algorithm 1 SGD with Hessian-corrected Momentum (SGDHess)
Open Source Code Yes The link to the code is provided in the appendix.
Open Datasets Yes Our Cifar10 experiment is conducted using the official implementation of Ada Hessian. ... We also train SGD, SGDHess, and Ada Hessian with Imagenet Deng et al. [2009] on Resnet18... We use the IWSLT 14 German to English dataset that contains 153k/7k/7k in the train/validation/test set.
Dataset Splits Yes We use the IWSLT 14 German to English dataset that contains 153k/7k/7k in the train/validation/test set.
Hardware Specification Yes All experiments are run on NVIDIA v100 GPUs.
Software Dependencies No The paper mentions 'Pytorch' but does not specify its version or any other software dependencies with version numbers.
Experiment Setup Yes For the rest of the optimizers, we performed a grid search on the base learning rate η {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1} to find the best settings. Similar to the Cifar10 experiment of Ada Hessian, we also trained our models on 160 epochs and we ran each optimizer 5 times and reported the average best accuracy as well as the standard deviation (detailed results in the appendix). We use standard parameter values for SGD (lr = 0.1, momentum = 0.9, weight_decay = 1e-4) for both SGD and SGDHess and the recommended parameters values for Ada Hessian. For the learning rate scheduler, we employ the plateau decay scheduler that was used in Yao et al. [2020]. We train our model in 90 epochs as usual.