reproducibilityindex.ai

Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models

Authors: Yingtao Zhang, Haoli Bai, Haokun Lin, Jialin Zhao, Lu Hou, Carlo Vittorio Cannistraci

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical experiments show that RIA alone can already surpass all existing post-training pruning methods on prevalent LLMs, e.g., LLa MA ranging from 7B to 65B. Furthermore, N:M semi-structured pruning with channel permutation can even outperform the original LLa MA2-70B on zero-shot tasks, together with practical speedup on specific hardware. Our code is available at: https://github.com/biomedicalcybernetics/Relative-importance-and-activation-pruning
Researcher Affiliation	Collaboration	Yingtao Zhang1,2 , Haoli Bai4, Haokun Lin5, Jialin Zhao1,2, Lu Hou4, Carlo Vittorio Cannistraci1,2,3 1Center for Complex Network Intelligence, Tsinghua Laboratory of Brain and Intelligence 2Department of Computer Science, Tsinghua University 3Department of Biomedical Engineering, Tsinghua University 4Huawei Noah s Ark Lab, 5Institute of Automation, Chinese Academy of Sciences
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Our code is available at: https://github.com/biomedicalcybernetics/Relative-importance-and-activation-pruning
Open Datasets	Yes	We use the public checkpoints of the involved models in the Hugging Face Transformers library 1. We utilize 3 NVIDIA A100 GPUs, each equipped with 80GB memory. [...] We employ 128 samples from the C4 dataset (Raffel et al., 2019) for all models, and each sample contains 2048 tokens.
Dataset Splits	Yes	We employ 128 samples from the C4 dataset (Raffel et al., 2019) for all models, and each sample contains 2048 tokens. This also aligns with the settings in baseline methods for a fair comparison. Note that we also discuss the choice of calibration data across different datasets, and more details can be found in Appendix C.
Hardware Specification	Yes	We utilize 3 NVIDIA A100 GPUs, each equipped with 80GB memory... We conduct tests on the Nvidia Tesla A100, utilizing the cu TLASS and cu SPARSELt library for Sparse Matrix-Matrix Multiplication (Sp MM) (Mishra et al., 2021) with N:M sparse matrices.
Software Dependencies	No	The paper mentions 'Py Torch' and specific libraries 'cu TLASS' and 'cu SPARSELt', but it does not specify version numbers for any of these software components.
Experiment Setup	Yes	For each model under consideration, we apply uniform pruning to all linear layers, with the exception of embeddings and the head. Specifically, each self-attention module has four linear layers, while each MLP module contains three linear layers for LLa MA model families and two for OPT. All the evaluations are conducted with the same code to make sure the comparison is fair. [...] We test each algorithm with 128 calibration data. [...] Batch size of input sequences is 8 and the sequence length is 128.