Why Transformers Need Adam: A Hessian Perspective
Authors: Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, Zhiquan Luo
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we provide an explanation through the lens of Hessian: (i) Transformers are heterogeneous : the Hessian spectrum across parameter blocks vary dramatically, a phenomenon we call block heterogeneity"; (ii) Heterogeneity hampers SGD: SGD performs worse than Adam on problems with block heterogeneity. To validate (i) and (ii), we check various Transformers, CNNs, MLPs, and quadratic problems, and find that SGD can perform on par with Adam on problems without block heterogeneity, but performs worse than Adam when the heterogeneity exists. Our initial theoretical analysis indicates that SGD performs worse because it applies one single learning rate to all blocks, which cannot handle the heterogeneity among blocks. |
| Researcher Affiliation | Academia | 1The Chinese University of Hong Kong, Shenzhen, China 2Shenzhen International Center For Industrial And Applied Mathematics, Shenzhen, China 3Shenzhen Research Institute of Big Data, Shenzhen, China |
| Pseudocode | Yes | Algorithm 1 Stochastic Gradient Descent with Momentum (SGD) Algorithm 2 Adam W Algorithm 3 Adam with no bias correction Algorithm 4 The Lanczos Algorithm Algorithm 5 The Stochastic Lanczos Quadrature Method |
| Open Source Code | Yes | Our code is available at https://github.com/zyushun/hessian-spectrum. |
| Open Datasets | Yes | CNNs. We study Res Net18 (11M) and VGG16 (138M) on Image Net [40, 78]. ... Transformers. We study Transformer with various scales and modalities, including GPT2 (125M) on Open Web Text [71]; Vi T-base (86M) on Image Net [27]; BERT (40M) on Cornell Movie-Dialogs Corpus [25]; GPT2-nano3 (11M) on English corpus. |
| Dataset Splits | No | The paper mentions various datasets used for training but does not explicitly state the dataset splits (e.g., percentages or sample counts for training, validation, and test sets). It implies the use of default configurations from codebases, but these are not specified within the paper's text. |
| Hardware Specification | Yes | The result is tested on a single V100. ... In total, it takes about 7 days on one A100 GPU to estimate all the blockwise Hessian spectra and the full Hessian spectrum. |
| Software Dependencies | No | The paper mentions using a "simple Py Torch implementation of SLQ" but does not specify any version numbers for PyTorch or other software dependencies. |
| Experiment Setup | Yes | We use batch size = 1024. ... We use batch size = 327, 680 tokens. ... We use batch size = 163, 840 tokens. ... We use batch size = 245, 760 tokens. ... We grid-search the learning rates for SGD and Adam under the same budget and report the best result for each optimizer. We use the cosine-decay learning rate schedule for vision tasks. |