SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

Authors: Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results demonstrate that SLEB outperforms previous LLM pruning methods in accelerating LLM inference while also maintaining superior perplexity and accuracy, making SLEB as a promising technique for enhancing the efficiency of LLMs.
Researcher Affiliation Collaboration 1Seoul National University 2Squeeze Bits Inc. 3Sungkyunkwan University.
Pseudocode Yes Algorithm 1 SLEB algorithm. We remove the transformer blocks until the number of removed blocks reaches the target number.
Open Source Code Yes The code is available at: https://github. com/jiwonsong-dev/SLEB.
Open Datasets Yes We use 128 samples randomly selected from Wiki Text-2 training dataset as calibration data, following the approach used in previous works (Ashkboos et al., 2024).
Dataset Splits Yes We use 128 samples randomly selected from Wiki Text-2 training dataset as calibration data, following the approach used in previous works (Ashkboos et al., 2024).
Hardware Specification Yes The experiments on redundancy verification and elimination of transformer blocks are executed on NVIDIA A100 GPUs equipped with 80GB of memory.
Software Dependencies No We implement SLEB in Py Torch (Paszke et al., 2019), using the Hugging Face Transformers library (Wolf et al., 2020). While specific software components are named, explicit version numbers for reproducibility (e.g., PyTorch 1.9 or Transformers 4.2.0) are not provided in the text.
Experiment Setup Yes We use 128 samples randomly selected from Wiki Text-2 training dataset as calibration data... Our evaluation encompasses models from the OPT and LLa MA-2 families. We assess SLEB under two target sparsity levels: 10% and 20%... For token generation, the test scenario consists of generating sentences with a length of 128 tokens and a batch size of 64. For prompt processing, we measure the latency when processing an input sequence with 2048 tokens.