SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks
Authors: Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results demonstrate that SLEB outperforms previous LLM pruning methods in accelerating LLM inference while also maintaining superior perplexity and accuracy, making SLEB as a promising technique for enhancing the efficiency of LLMs. |
| Researcher Affiliation | Collaboration | 1Seoul National University 2Squeeze Bits Inc. 3Sungkyunkwan University. |
| Pseudocode | Yes | Algorithm 1 SLEB algorithm. We remove the transformer blocks until the number of removed blocks reaches the target number. |
| Open Source Code | Yes | The code is available at: https://github. com/jiwonsong-dev/SLEB. |
| Open Datasets | Yes | We use 128 samples randomly selected from Wiki Text-2 training dataset as calibration data, following the approach used in previous works (Ashkboos et al., 2024). |
| Dataset Splits | Yes | We use 128 samples randomly selected from Wiki Text-2 training dataset as calibration data, following the approach used in previous works (Ashkboos et al., 2024). |
| Hardware Specification | Yes | The experiments on redundancy verification and elimination of transformer blocks are executed on NVIDIA A100 GPUs equipped with 80GB of memory. |
| Software Dependencies | No | We implement SLEB in Py Torch (Paszke et al., 2019), using the Hugging Face Transformers library (Wolf et al., 2020). While specific software components are named, explicit version numbers for reproducibility (e.g., PyTorch 1.9 or Transformers 4.2.0) are not provided in the text. |
| Experiment Setup | Yes | We use 128 samples randomly selected from Wiki Text-2 training dataset as calibration data... Our evaluation encompasses models from the OPT and LLa MA-2 families. We assess SLEB under two target sparsity levels: 10% and 20%... For token generation, the test scenario consists of generating sentences with a length of 128 tokens and a batch size of 64. For prompt processing, we measure the latency when processing an input sequence with 2048 tokens. |