DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models
Authors: Shangqian Gao, Chi-Heng Lin, Ting Hua, Zheng Tang, Yilin Shen, Hongxia Jin, Yen-Chang Hsu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on various LLMs, including OPT, LLa MA, LLa MA-2, Phi-1.5, and Phi-2. Experimental results demonstrate that our approach outperforms other state-of-the-art methods, showing for the first time that structural pruning can achieve an accuracy similar to semi-structural pruning. |
| Researcher Affiliation | Collaboration | Shangqian Gao Florida State University Chi-Heng Lin Samsung Research America Ting Hua Samsung Research America Tang Zheng Samsung Research America Yilin Shen Samsung Research America Hongxia Jin Samsung Research America Yen-Chang Hsu Samsung Research America |
| Pseudocode | Yes | Algorithm 1: Block inference after pruning. |
| Open Source Code | No | Due to the company policy, the code will only be released after going through the internal review process. |
| Open Datasets | Yes | Following previous papers [2, 30], we use Wiki Text-2 and Alpaca datasets to train the hypernetwork. |
| Dataset Splits | No | The paper mentions using Wiki Text-2 and Alpaca datasets to train the hypernetwork but does not specify explicit training, validation, or test splits with percentages or sample counts for these datasets. |
| Hardware Specification | Yes | Depending on the size of the base model, we use 1 to 4 NVIDIA A100 GPUs to train the hypernetwork. |
| Software Dependencies | No | The paper mentions 'Pytorch [32] and Hugging Face transformer library [41]' but does not specify version numbers for these software components. |
| Experiment Setup | Yes | The hypernetwork is trained for 10,000 iterations for all models. For all experiments, we set λ in Obj. 5 to 6. During training the hypernetwork, we use Adam W optimizer to optimize it with a constant learning rate 10 3 and weight decay 0.05. We always set the mini-batchsize to 1 on each GPU. |