The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers

Authors: Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that enforcing an even sparser activation via Top-k thresholding with a small k brings a collection of desired properties, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.
Researcher Affiliation Industry Zonglin Li , Chong You , Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, and Sanjiv Kumar Google Research, New York City, USA {lizonglin,cyou,bsrinadh,daliangli,ankitsrawat}@google.com {sashank,kkye,fchern,felixyu,guorq,sanjivk}@google.com
Pseudocode No The paper does not contain pseudocode or algorithm blocks.
Open Source Code No Our experiment with T5 and Vi T uses the T5X (Roberts et al., 2022) and the Scenic codebase (Dehghani et al., 2022), respectively.
Open Datasets Yes T5 is an encoder-decoder model for natural language processing tasks (Raffel et al., 2020). We train T5 on the Colossal Clean Crawled Corpus (C4) using the span corruption task. Vi T is an encoder model for vision tasks (Dosovitskiy et al., 2021). Unless specified otherwise, we train Vi T on Image Net-21k (Deng et al., 2009), an image classification dataset with 14M images and 21k classes. For certain cases we also use Image Net-1k which is a subset of Image Net-21k with 1.3M images and 1k classes.
Dataset Splits No The paper states that it uses C4 and Image Net-21k/1k, and mentions evaluation on training and evaluation data, but does not provide specific train/validation/test splits (e.g., percentages or counts) or reference standard splits with citations.
Hardware Specification Yes We provide proof-of-concept results on wall time reduction for the task of unbatched decoding on TPUv4 with a large Top-k T5.
Software Dependencies No Our experiment with T5 and Vi T uses the T5X (Roberts et al., 2022) and the Scenic codebase (Dehghani et al., 2022), respectively. We use the implementation of jax.lax.approx_max_k (Chern et al., 2022) with a recall target of 0.95. For most of the experiments, except the Top-k transformer, we used vanilla T5 architecture (Raffel et al., 2020). We trained model with Adafactor optimizer. Following Dosovitskiy et al. (2021), we train Vi T using ADAM (Kingma & Ba, 2015) as the optimizer. We evaluate the sparsity level of BERT models (Devlin et al., 2019). We specifically consider BERT Base (12 layers) and BERT Large (24 layers) Transformer models, with Re LU activation in the MLP layers. We follow the same training receipe as Devlin et al. (2019) and pre-train these models on Wikipedia and Books dataset using Masked Language Modelling (MLM) objective. We train for 450000 steps with a batch size of 1024 using Adam W optimizer with 1e 4 learning rate. We evaluate the sparsity level of the MLP-Mixer (Tolstikhin et al., 2021), an all-MLP architecture constructed from cascading token-mixing and channel-mixing MLPs. Specifically, we use Mixer-B16 as the architecture, ADAM with β1 = 0.9, β2 = 0.999 as the optimizer, and train on Image Net-21k for 300 epochs.
Experiment Setup Yes For most of the experiments, except the Top-k transformer, we used vanilla T5 architecture (Raffel et al., 2020). We trained model with Adafactor optimizer, an inverse square root learning rate schedule, and no dropout. For the first 10,000 steps we also use a fixed learning rate of 0.01 as warm-up. The training task is span corruption without any mixture, and unless specified otherwise, we train the model for 100,000 steps with batch size of 256 to save compute and time, as the sparsity or accuracy trend is already clear by then. We used 512 tokens on the encoder side and 114 tokens on the decoder side. Following Dosovitskiy et al. (2021), we train Vi T using ADAM (Kingma & Ba, 2015) as the optimizer with β1 = 0.9, β2 = 0.999. Other training details such as weight decay, dropout rate, and learning rate all follow the description in (Dosovitskiy et al., 2021, Section B.1) except that we train for 180 epochs (as opposed to 300) on Image Net-1k.