Why are Adaptive Methods Good for Attention Models?

Authors: Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, Suvrit Sra

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is one cause of SGD s poor performance. We provide a positive answer to the above question by performing both theoretical and empirical studies of the convergence of optimization methods under heavy-tailed noise.
Researcher Affiliation Collaboration Jingzhao Zhang MIT jzhzhang@mit.edu Sai Praneeth Karimireddy EPFL sai.karimireddy@epfl.ch Andreas Veit Google Research aveit@google.com Seungyeon Kim Google Research seungyeonk@google.com Sashank Reddi Google Research sashank@google.com Sanjiv Kumar Google Research sanjivk@google.com Suvrit Sra MIT suvrit@mit.edu
Pseudocode Yes Algorithm 1 ACClip
Open Source Code No The paper adapts a third-party GitHub repository ('https://github.com/kimiyoung/transformer-xl/tree/master/pytorch') for some experiments, but it does not provide an explicit statement or link for the source code of their proposed ACClip algorithm.
Open Datasets Yes We first investigate the distribution of the gradient noise norm g f(x) in the aforementioned neural network models, where g is the stochastic gradient computed from a minibatch sample. In particular, we focus on noise distributions while training two popular deep learning models BERT and Res Net. Note that BERT and Res Net are typically trained with Adam and SGD (with momentum) respectively and can thus, provide insights about difference between these optimizers. We first investigate the distribution of the gradient noise norm g f(x) in the aforementioned neural network models, where g is the stochastic gradient computed from a minibatch sample. We train a 6-layer Transformer-XL model[8] on PTB dataset as a proof of concept. We now evaluate the empirical performance of our proposed ACClip algorithm on BERT pre-training as well fine-tuning using the SQUAD v1.1 dataset.
Dataset Splits Yes The learning rates and hyperparameters for each method have been extensively tuned to provide best performance on validation set. We again follow the procedure outlined in [9] and present the results on the Dev set in Table 3.
Hardware Specification No The paper does not provide specific details about the hardware used, such as GPU models, CPU types, or memory specifications. It does not mention any specific hardware setup used for running experiments.
Software Dependencies No The paper mentions software like 'pytorch' in a GitHub link and refers to the 'Adam optimizer' and 'BERT paper', implying the use of their respective frameworks (likely TensorFlow), but it does not specify version numbers for any software dependencies.
Experiment Setup Yes For ACClip, we set τ = 1, learning rate = 1e-4, β1 = 0.9, β2 = 0.99, ϵ = 1e-5 and weight decay = 1e-5.