Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Better SGD using Second-order Momentum
Authors: Hoang Tran, Ashok Cutkosky
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our algorithm not only enjoys optimal theoretical properties, it is also practically effective, as demonstrated through our experimental results across various deep learning tasks. |
| Researcher Affiliation | Academia | Hoang Tran Boston University EMAIL Ashok Cutkosky Boston University EMAIL |
| Pseudocode | Yes | Algorithm 1 SGD with Hessian-corrected Momentum (SGDHess) |
| Open Source Code | Yes | The link to the code is provided in the appendix. |
| Open Datasets | Yes | Our Cifar10 experiment is conducted using the official implementation of Ada Hessian. ... We also train SGD, SGDHess, and Ada Hessian with Imagenet Deng et al. [2009] on Resnet18... We use the IWSLT 14 German to English dataset that contains 153k/7k/7k in the train/validation/test set. |
| Dataset Splits | Yes | We use the IWSLT 14 German to English dataset that contains 153k/7k/7k in the train/validation/test set. |
| Hardware Specification | Yes | All experiments are run on NVIDIA v100 GPUs. |
| Software Dependencies | No | The paper mentions 'Pytorch' but does not specify its version or any other software dependencies with version numbers. |
| Experiment Setup | Yes | For the rest of the optimizers, we performed a grid search on the base learning rate η {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1} to find the best settings. Similar to the Cifar10 experiment of Ada Hessian, we also trained our models on 160 epochs and we ran each optimizer 5 times and reported the average best accuracy as well as the standard deviation (detailed results in the appendix). We use standard parameter values for SGD (lr = 0.1, momentum = 0.9, weight_decay = 1e-4) for both SGD and SGDHess and the recommended parameters values for Ada Hessian. For the learning rate scheduler, we employ the plateau decay scheduler that was used in Yao et al. [2020]. We train our model in 90 epochs as usual. |