Better SGD using Second-order Momentum
Authors: Hoang Tran, Ashok Cutkosky
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our algorithm not only enjoys optimal theoretical properties, it is also practically effective, as demonstrated through our experimental results across various deep learning tasks. |
| Researcher Affiliation | Academia | Hoang Tran Boston University tranhp@bu.edu Ashok Cutkosky Boston University ashok@cutkosky.com |
| Pseudocode | Yes | Algorithm 1 SGD with Hessian-corrected Momentum (SGDHess) |
| Open Source Code | Yes | The link to the code is provided in the appendix. |
| Open Datasets | Yes | Our Cifar10 experiment is conducted using the official implementation of Ada Hessian. ... We also train SGD, SGDHess, and Ada Hessian with Imagenet Deng et al. [2009] on Resnet18... We use the IWSLT 14 German to English dataset that contains 153k/7k/7k in the train/validation/test set. |
| Dataset Splits | Yes | We use the IWSLT 14 German to English dataset that contains 153k/7k/7k in the train/validation/test set. |
| Hardware Specification | Yes | All experiments are run on NVIDIA v100 GPUs. |
| Software Dependencies | No | The paper mentions 'Pytorch' but does not specify its version or any other software dependencies with version numbers. |
| Experiment Setup | Yes | For the rest of the optimizers, we performed a grid search on the base learning rate η {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1} to find the best settings. Similar to the Cifar10 experiment of Ada Hessian, we also trained our models on 160 epochs and we ran each optimizer 5 times and reported the average best accuracy as well as the standard deviation (detailed results in the appendix). We use standard parameter values for SGD (lr = 0.1, momentum = 0.9, weight_decay = 1e-4) for both SGD and SGDHess and the recommended parameters values for Ada Hessian. For the learning rate scheduler, we employ the plateau decay scheduler that was used in Yao et al. [2020]. We train our model in 90 epochs as usual. |