Soft Prompt Recovers Compressed LLMs, Transferably

Authors: Zhaozhuo Xu, Zirui Liu, Beidi Chen, Shaochen Zhong, Yuxin Tang, Jue Wang, Kaixiong Zhou, Xia Hu, Anshumali Shrivastava

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We showcase that our learnable soft prompt can restore the performance of LLMs with up to 8 compression (with a joint 4-bit quantization and 50% weight pruning compression), allowing them to match their uncompressed counterparts on several standard benchmarks. Moreover, these findings are not only valid but also generalizable across various model families, datasets, and tasks, underscoring the broad applicability and impact of our work. Furthermore, we show that compared to other parameter-efficient fine-tuning methods like Lo RA (Hu et al., 2021), our approach has less cost in recovering the performance of compressed LLMs. We assess the trade-off using LLa MA (Touvron et al., 2023a) on C4 dataset (Raffel et al., 2020). Here we adopt two representative post-training compression methods, i.e., GPTQ (Frantar et al., 2022) and Sparse GPT (Frantar & Alistarh, 2023), to analyze the trade-off across various compression levels. Figure 3 shows the impact of our approach on the validation set of C4. We observe a significant improvement in PPL across all compression levels.
Researcher Affiliation Collaboration Zhaozhuo Xu * 1 Zirui Liu * 2 Beidi Chen 3 Shaochen (Henry) Zhong 2 Yuxin Tang 2 Jue Wang 4 Kaixiong Zhou 5 Xia Hu 2 Anshumali Shrivastava 2 6 *Equal contribution 1Department of Computer Science, Stevens Institute of Technology 2Department of Computer Science, Rice University 3Department of Electrical and Computer Engineering, Carnegie Mellon University 4Together AI 5Department of Electrical and Computer Engineering, North Carolina State University 6Third AI Corp.
Pseudocode No No pseudocode or algorithm blocks are present. The paper describes methods verbally and with mathematical formulations but does not include any explicitly labeled or structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/zirui-ray-liu/compress-thenprompt.
Open Datasets Yes We use Common Crawl s web corpus (C4) (Raffel et al., 2020), Wikitext-2 (Merity et al., 2017), and the Penn Treebank (PTB) (Marcus et al., 1994) databases as language generation datasets.
Dataset Splits Yes We would like to note that the post-training compression is conducted using the training set of C4, and subsequently, we evaluate the performance of the compression with the validation set of C4. Figure 3 shows the impact of our approach on the validation set of C4. On the C4 training set, we compress the OPT-1.3B, OPT-2.7B, OPT-6.7B, and LLa MA-7B using Sparse GPT (Frantar & Alistarh, 2023).
Hardware Specification Yes We use Nvidia RTX 8000 (48G) GPUs to conduct inference and prompt learning in LLMs. All experiments are conducted on a server with eight Nvidia RTX 8000 (48G) GPUs, 1.5T main memory, and two AMD EPYC 7742 64-Core Processors.
Software Dependencies Yes The software and package versions are specified in Table 5. Table 5: Package configurations of our experiments. Package Version CUDA 11.6 pytorch 2.0.1 transformers 4.30.0.dev0 accelerate 0.18.0
Experiment Setup Yes In the experiment, we employed the Adam W (Loshchilov & Hutter, 2019) optimizer as our chosen optimizer. We conducted iterative prompt updates using a batch size of 4, a weight decay of 10 5, and a learning rate of 10 3. We set the total optimization steps as 30,000 and use the model corresponding to the best validation perplexity as the final model.