Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs
Authors: Yuxin Zhang, Lirui Zhao, Mingbao Lin, Sun Yunyun, Yiwu Yao, Xingjia Han, Jared Tanner, Shiwei Liu, Rongrong Ji
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on LLa MA-V1/V2, Vicuna, and OPT across various benchmarks demonstrate the effectiveness of DS T in enhancing the performance of sparse LLMs, especially at high sparsity levels. |
| Researcher Affiliation | Collaboration | Yuxin Zhang1,2 Lirui Zhao1 Mingbao Lin3 Yunyun Sun4 Yiwu Yao4 Xingjia Han4 Jared Tanner5 Shiwei Liu5,6,7 Rongrong Ji1,8 1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University 2 Pengcheng Lab 3 Tencent Youtu Lab 4Huawei Technologies, 5University of Oxford, 6University of Texas at Austin 7Eindhoven University of Technology, 8Institute of Artificial Intelligence, Xiamen University |
| Pseudocode | Yes | Algorithm 1: Pseudocode of DS T. |
| Open Source Code | Yes | Codes are available at https://github.com/zyxxmu/DSno T. |
| Open Datasets | Yes | calibration data consists of 128 segments, each with 2048 tokens. These segments are randomly selected from the first shard of the C4 dataset (Raffel et al., 2020). |
| Dataset Splits | Yes | we assess the performance of pruned models by calculating the perplexity of language generation experiments on separate validation sets derived from Wiki Text2 (Merity et al., 2016). |
| Hardware Specification | Yes | All pruning experiments are conducted on NVIDIA A100 GPUs with 80GB of memory. |
| Software Dependencies | No | We implement DS T in Py Torch (Paszke et al., 2019) and use the Hugging Face Transformers library (Wolf et al., 2019) for handling models and datasets. (No version numbers provided for PyTorch or Hugging Face Transformers). |
| Experiment Setup | Yes | For the hyper-parameter settings, we set the maximum cycle T = 50 and the update threshold ϵ = 0.1 in all experiments. |