Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
Authors: Junyuan Hong, Jinhao Duan, Chenhui Zhang, Zhangheng Li, Chulin Xie, Kelsey Lieberman, James Diffenderfer, Brian R. Bartoldson, Ajay Kumar Jaiswal, Kaidi Xu, Bhavya Kailkhura, Dan Hendrycks, Dawn Song, Zhangyang Wang, Bo Li
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This study conducts the first, thorough evaluation of three (3) leading LLMs using five (5) SoTA compression techniques across eight (8) trustworthiness dimensions. Our experiments highlight the intricate interplay between compression and trustworthiness, revealing some interesting patterns. |
| Researcher Affiliation | Academia | 1University of Texas at Austin 2Drexel University 3MIT 4UIUC 5Duke University 6Lawrence Livermore National Laboratory 7Center for AI Safety 8University of California, Berkeley 9University of Chicago. |
| Pseudocode | No | The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Model & Code: https://decoding-comp-trust.github.io o WARNING: This paper contains model outputs that may be considered offensive. To benefit the reproducibility of our experiments, we release all models tested in the benchmark and the modified Decoding Trust benchmark to mitigate the large score variances caused by the large refusal rates. The links can be found on our website. |
| Open Datasets | Yes | LLAMA2 13b is an LLM pre-trained on 2 trillion tokens of publicly available data in an auto-regressive manner. (Touvron et al., 2023b). calibration sets from the C4 dataset (Raffel et al., 2019). Massive Multitask Learning Understanding (MMLU) (Hendrycks et al., 2020). Decoding Trust (Wang et al., 2023a). a pre-processed Enron Mail dataset is utilized for evaluation. |
| Dataset Splits | No | The paper mentions using 'calibration data' and 'randomly sampled calibration sets' for compression methods, and discusses pre-existing benchmarks like MMLU and Decoding Trust. However, it does not explicitly provide specific training/validation/test dataset split percentages or counts for its experiments. |
| Hardware Specification | Yes | For example, compressing a 13b model to 4 bits takes merely half an hour on a 48Gb A40 GPU and results in an average speedup of 3.2 3.3 in inference speed, as demonstrated by AWQ compared to Huggingface s FP16 implementation (Lin et al., 2023). |
| Software Dependencies | No | For pruning, we use the pruning library from wanda1. For quantization, we used Auto GPTQ2 and AWQ3. Commands to reproduce models are included in our website. |
| Experiment Setup | Yes | We use the 13b models as a baseline to scrutinize the compressed trust and compare 7b and 7b-sized compressed models. 7b-sized models are compressed from 13b LLMs, LLAMA2 Chat, LLAMA2, and Vicuna by two quantization and three pruning methods. As Sparse GPT with 50% sparsity is sensitive to the calibration set, we repeat the experiments with three randomly sampled calibration sets from the C4 dataset (Raffel et2al., 2019) and report the average. To answer these questions, we extend the LLAMA2 13b Chat experiments to 3,4 bits using GPTQ and AWQ. For 3-bit and 4-bit, we repeat the experiments three times with randomly subsampled calibration sets. |