reproducibilityindex.ai

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

Authors: Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results demonstrate that SLEB outperforms previous LLM pruning methods in accelerating LLM inference while also maintaining superior perplexity and accuracy, making SLEB as a promising technique for enhancing the efficiency of LLMs.
Researcher Affiliation	Collaboration	1Seoul National University 2Squeeze Bits Inc. 3Sungkyunkwan University.
Pseudocode	Yes	Algorithm 1 SLEB algorithm. We remove the transformer blocks until the number of removed blocks reaches the target number.
Open Source Code	Yes	The code is available at: https://github. com/jiwonsong-dev/SLEB.
Open Datasets	Yes	We use 128 samples randomly selected from Wiki Text-2 training dataset as calibration data, following the approach used in previous works (Ashkboos et al., 2024).
Dataset Splits	Yes	We use 128 samples randomly selected from Wiki Text-2 training dataset as calibration data, following the approach used in previous works (Ashkboos et al., 2024).
Hardware Specification	Yes	The experiments on redundancy verification and elimination of transformer blocks are executed on NVIDIA A100 GPUs equipped with 80GB of memory.
Software Dependencies	No	We implement SLEB in Py Torch (Paszke et al., 2019), using the Hugging Face Transformers library (Wolf et al., 2020). While specific software components are named, explicit version numbers for reproducibility (e.g., PyTorch 1.9 or Transformers 4.2.0) are not provided in the text.
Experiment Setup	Yes	We use 128 samples randomly selected from Wiki Text-2 training dataset as calibration data... Our evaluation encompasses models from the OPT and LLa MA-2 families. We assess SLEB under two target sparsity levels: 10% and 20%... For token generation, the test scenario consists of generating sentences with a length of 128 tokens and a batch size of 64. For prompt processing, we measure the latency when processing an input sequence with 2048 tokens.