Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models
Authors: Yingtao Zhang, Haoli Bai, Haokun Lin, Jialin Zhao, Lu Hou, Carlo Vittorio Cannistraci
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical experiments show that RIA alone can already surpass all existing post-training pruning methods on prevalent LLMs, e.g., LLa MA ranging from 7B to 65B. Furthermore, N:M semi-structured pruning with channel permutation can even outperform the original LLa MA2-70B on zero-shot tasks, together with practical speedup on specific hardware. Our code is available at: https://github.com/biomedicalcybernetics/Relative-importance-and-activation-pruning |
| Researcher Affiliation | Collaboration | Yingtao Zhang1,2 , Haoli Bai4, Haokun Lin5, Jialin Zhao1,2, Lu Hou4, Carlo Vittorio Cannistraci1,2,3 1Center for Complex Network Intelligence, Tsinghua Laboratory of Brain and Intelligence 2Department of Computer Science, Tsinghua University 3Department of Biomedical Engineering, Tsinghua University 4Huawei Noah s Ark Lab, 5Institute of Automation, Chinese Academy of Sciences |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Our code is available at: https://github.com/biomedicalcybernetics/Relative-importance-and-activation-pruning |
| Open Datasets | Yes | We use the public checkpoints of the involved models in the Hugging Face Transformers library 1. We utilize 3 NVIDIA A100 GPUs, each equipped with 80GB memory. [...] We employ 128 samples from the C4 dataset (Raffel et al., 2019) for all models, and each sample contains 2048 tokens. |
| Dataset Splits | Yes | We employ 128 samples from the C4 dataset (Raffel et al., 2019) for all models, and each sample contains 2048 tokens. This also aligns with the settings in baseline methods for a fair comparison. Note that we also discuss the choice of calibration data across different datasets, and more details can be found in Appendix C. |
| Hardware Specification | Yes | We utilize 3 NVIDIA A100 GPUs, each equipped with 80GB memory... We conduct tests on the Nvidia Tesla A100, utilizing the cu TLASS and cu SPARSELt library for Sparse Matrix-Matrix Multiplication (Sp MM) (Mishra et al., 2021) with N:M sparse matrices. |
| Software Dependencies | No | The paper mentions 'Py Torch' and specific libraries 'cu TLASS' and 'cu SPARSELt', but it does not specify version numbers for any of these software components. |
| Experiment Setup | Yes | For each model under consideration, we apply uniform pruning to all linear layers, with the exception of embeddings and the head. Specifically, each self-attention module has four linear layers, while each MLP module contains three linear layers for LLa MA model families and two for OPT. All the evaluations are conducted with the same code to make sure the comparison is fair. [...] We test each algorithm with 128 calibration data. [...] Batch size of input sequences is 8 and the sequence length is 128. |