ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning
Authors: Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, Michael Mahoney10665-10673
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform extensive tests on NLP, CV, and recommendation system tasks, and ADAHESSIAN achieves state-of-the-art results. We extensively test ADAHESSIAN on a wide range of learning tasks. |
| Researcher Affiliation | Academia | Zhewei Yao1 , Amir Gholami1 , Sheng Shen1, Mustafa Mustafa2, Kurt Keutzer1, Michael Mahoney1 1University of California, Berkeley 2Lawrence Berkeley National Laboratory |
| Pseudocode | Yes | Algorithm 1: ADAHESSIAN |
| Open Source Code | Yes | The code for ADAHESSIAN is open-sourced and publicly-available (Yao and Gholami 2020). Yao, Z.; and Gholami, A. 2020. https://github.com/amirgholami/ADAHESSIAN.git. Github Online System . |
| Open Datasets | Yes | We perform extensive tests on NLP, CV, and recommendation system tasks, and ADAHESSIAN achieves state-of-the-art results. In particular, we find that ADAHESSIAN: (i) outperforms Adam W for transformers by 0.13/0.33 BLEU score on IWSLT14/WMT14, 2.7/1.0 PPL on PTB/Wikitext-103; (ii) outperforms Adam W for Squeeze Bert by 0.41 points on GLUE; (iii) achieves 1.45%/5.55% higher accuracy on Res Net32/Res Net18 on Cifar10/Image Net as compared to Adam; and (iv) achieves 0.032% better score than Adagrad for DLRM on the Criteo Ad Kaggle dataset. |
| Dataset Splits | Yes | We report the NLU results in Tab. 3, using the Squeeze BERT model (Iandola et al. 2016) tested on GLUE datasets (Wang et al. 2018a). As can be seen, ADAHESSIAN has better performance than Adam W on 5 out of 8 tasks. Particularly, on RTE and MPRC, ADAHESSIAN achieves more than 1 point as compared to Adam W. On average, ADAHESSIAN outperforms Adam W by 0.41 points. Note that similar to NMT and LM, except learning rate and block size, ADAHESSIAN directly uses the same hyperparameters as Adam W. Interestingly note that these results are better than those reported in Squeeze BERT (Iandola et al. 2020), even though we only change the optimizer to ADAHESSIAN instead of Adam W. |
| Hardware Specification | Yes | We have also measured the actual runtime of ADAHESSIAN in Py Torch on a single RTX Titan GPU machine, as reported in the second column of Tab. 6. |
| Software Dependencies | No | The paper mentions "Py Torch" but does not specify a version number for it or any other key software dependencies required for replication. |
| Experiment Setup | Yes | For each task we use the optimal hyperparameters reported in the literature for SGD and Adam W to compare with a strong baseline. However, we perform little tuning on ADAHESSIAN since first we do not have access to industrial scale resources to do extensive tuning, and second we want to show the average performance of ADAHESSIAN instead of the absolute best performance achieved with brute force tuning. As such, we directly use the same β1, β2, weight decay, batch size, dropout rate and learning rate schedule in ADAHESSIAN as in Adam W for each task (even though tuning those is expected to improve ADAHESSIAN performance). For ADAHESSIAN we only tune the learning rate and the spatial averaging block size b. Please see Appendix for more detailed experimental settings. |