Why Transformers Need Adam: A Hessian Perspective

Authors: Yushun Zhang, Congliang Chen, Tian Ding, Ziniu Li, Ruoyu Sun, Zhiquan Luo

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we provide an explanation through the lens of Hessian: (i) Transformers are heterogeneous : the Hessian spectrum across parameter blocks vary dramatically, a phenomenon we call block heterogeneity"; (ii) Heterogeneity hampers SGD: SGD performs worse than Adam on problems with block heterogeneity. To validate (i) and (ii), we check various Transformers, CNNs, MLPs, and quadratic problems, and find that SGD can perform on par with Adam on problems without block heterogeneity, but performs worse than Adam when the heterogeneity exists. Our initial theoretical analysis indicates that SGD performs worse because it applies one single learning rate to all blocks, which cannot handle the heterogeneity among blocks.
Researcher Affiliation Academia 1The Chinese University of Hong Kong, Shenzhen, China 2Shenzhen International Center For Industrial And Applied Mathematics, Shenzhen, China 3Shenzhen Research Institute of Big Data, Shenzhen, China
Pseudocode Yes Algorithm 1 Stochastic Gradient Descent with Momentum (SGD) Algorithm 2 Adam W Algorithm 3 Adam with no bias correction Algorithm 4 The Lanczos Algorithm Algorithm 5 The Stochastic Lanczos Quadrature Method
Open Source Code Yes Our code is available at https://github.com/zyushun/hessian-spectrum.
Open Datasets Yes CNNs. We study Res Net18 (11M) and VGG16 (138M) on Image Net [40, 78]. ... Transformers. We study Transformer with various scales and modalities, including GPT2 (125M) on Open Web Text [71]; Vi T-base (86M) on Image Net [27]; BERT (40M) on Cornell Movie-Dialogs Corpus [25]; GPT2-nano3 (11M) on English corpus.
Dataset Splits No The paper mentions various datasets used for training but does not explicitly state the dataset splits (e.g., percentages or sample counts for training, validation, and test sets). It implies the use of default configurations from codebases, but these are not specified within the paper's text.
Hardware Specification Yes The result is tested on a single V100. ... In total, it takes about 7 days on one A100 GPU to estimate all the blockwise Hessian spectra and the full Hessian spectrum.
Software Dependencies No The paper mentions using a "simple Py Torch implementation of SLQ" but does not specify any version numbers for PyTorch or other software dependencies.
Experiment Setup Yes We use batch size = 1024. ... We use batch size = 327, 680 tokens. ... We use batch size = 163, 840 tokens. ... We use batch size = 245, 760 tokens. ... We grid-search the learning rates for SGD and Adam under the same budget and report the best result for each optimizer. We use the cosine-decay learning rate schedule for vision tasks.