ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models

Authors: Seyed Iman Mirzadeh, Keivan Alizadeh-Vahid, Sachin Mehta, Carlo C del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, Mehrdad Farajtabar

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that using the Re LU activation function has a negligible impact on convergence and performance while significantly reducing computation and weight transfer. This reduction is particularly valuable during the memory-bound inference step, where efficiency is paramount. Exploring sparsity patterns in Re LU-based LLMs, we unveil the reutilization of activated neurons for generating new tokens and leveraging these insights, we propose practical strategies to substantially reduce LLM inference computation up to three times, using Re LU activations with minimal performance trade-offs.
Researcher Affiliation Industry Iman Mirzadeh , Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo Oncel Tuzel, Golnoosh Samei , Mohammad Rastegari, Mehrdad Farajtabar Apple
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We use the Refined Web dataset (Penedo et al., 2023), for our pretraining in Sec. 3.2 and finetuning pretrained models in Sec. 4. We chose Refined Web because it is a high-quality subset of Common Crawl, which is often used in the pretraining phase of LLMs, including Llama, Falcon, and OPT. We also use the validation split of Wiki Text (Merity et al., 2017) for measuring the sparsity and recording preactivation distributions of various pretrained models.
Dataset Splits Yes We also use the validation split of Wiki Text (Merity et al., 2017) for measuring the sparsity and recording preactivation distributions of various pretrained models.
Hardware Specification Yes Overall, as depicted in Figure 9b based on the calculations by Liu et al. (2023b), we demonstrate that for the OPT model on an NVIDIA A100 node, counting FLOPS provides a reasonable approximation to and is highly correlated with the time needs to generate tokens, especially, for LLMs with activation sparsity. [...] We measured the average latency of our 16-bit Ge MV kernel with vector and matrix dimensions of 8192 on a Mac Book Pro equipped with an Apple M2 Pro chip.
Software Dependencies No The paper mentions software tools like "cu SPARSE on NVIDIA CUDA" and "Accelerate on Apple devices" but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes For finetuning the pretrained models, we follow the original pretraining recipe, except we use a fixed learning rate of 1.5e-5 for Llama 7B, Falcon 7B, and OPT 6.7B models. In addition, we use the Adam W optimizer (Loshchilov & Hutter, 2019) for our finetuning with Ze RO stage 1 (Rajbhandari et al., 2020), where we shard the optimizer states across different GPUs. For pretraining OPT 1.3B models from scratch in Sec. 3.2, we follow the OPT training recipe.