Soft Prompt Recovers Compressed LLMs, Transferably
Authors: Zhaozhuo Xu, Zirui Liu, Beidi Chen, Shaochen Zhong, Yuxin Tang, Jue Wang, Kaixiong Zhou, Xia Hu, Anshumali Shrivastava
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We showcase that our learnable soft prompt can restore the performance of LLMs with up to 8 compression (with a joint 4-bit quantization and 50% weight pruning compression), allowing them to match their uncompressed counterparts on several standard benchmarks. Moreover, these findings are not only valid but also generalizable across various model families, datasets, and tasks, underscoring the broad applicability and impact of our work. Furthermore, we show that compared to other parameter-efficient fine-tuning methods like Lo RA (Hu et al., 2021), our approach has less cost in recovering the performance of compressed LLMs. We assess the trade-off using LLa MA (Touvron et al., 2023a) on C4 dataset (Raffel et al., 2020). Here we adopt two representative post-training compression methods, i.e., GPTQ (Frantar et al., 2022) and Sparse GPT (Frantar & Alistarh, 2023), to analyze the trade-off across various compression levels. Figure 3 shows the impact of our approach on the validation set of C4. We observe a significant improvement in PPL across all compression levels. |
| Researcher Affiliation | Collaboration | Zhaozhuo Xu * 1 Zirui Liu * 2 Beidi Chen 3 Shaochen (Henry) Zhong 2 Yuxin Tang 2 Jue Wang 4 Kaixiong Zhou 5 Xia Hu 2 Anshumali Shrivastava 2 6 *Equal contribution 1Department of Computer Science, Stevens Institute of Technology 2Department of Computer Science, Rice University 3Department of Electrical and Computer Engineering, Carnegie Mellon University 4Together AI 5Department of Electrical and Computer Engineering, North Carolina State University 6Third AI Corp. |
| Pseudocode | No | No pseudocode or algorithm blocks are present. The paper describes methods verbally and with mathematical formulations but does not include any explicitly labeled or structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/zirui-ray-liu/compress-thenprompt. |
| Open Datasets | Yes | We use Common Crawl s web corpus (C4) (Raffel et al., 2020), Wikitext-2 (Merity et al., 2017), and the Penn Treebank (PTB) (Marcus et al., 1994) databases as language generation datasets. |
| Dataset Splits | Yes | We would like to note that the post-training compression is conducted using the training set of C4, and subsequently, we evaluate the performance of the compression with the validation set of C4. Figure 3 shows the impact of our approach on the validation set of C4. On the C4 training set, we compress the OPT-1.3B, OPT-2.7B, OPT-6.7B, and LLa MA-7B using Sparse GPT (Frantar & Alistarh, 2023). |
| Hardware Specification | Yes | We use Nvidia RTX 8000 (48G) GPUs to conduct inference and prompt learning in LLMs. All experiments are conducted on a server with eight Nvidia RTX 8000 (48G) GPUs, 1.5T main memory, and two AMD EPYC 7742 64-Core Processors. |
| Software Dependencies | Yes | The software and package versions are specified in Table 5. Table 5: Package configurations of our experiments. Package Version CUDA 11.6 pytorch 2.0.1 transformers 4.30.0.dev0 accelerate 0.18.0 |
| Experiment Setup | Yes | In the experiment, we employed the Adam W (Loshchilov & Hutter, 2019) optimizer as our chosen optimizer. We conducted iterative prompt updates using a batch size of 4, a weight decay of 10 5, and a learning rate of 10 3. We set the total optimization steps as 30,000 and use the model corresponding to the best validation perplexity as the final model. |