Accuracy is Not All You Need
Authors: Abhinav Dutta, Sanjeev Krishnan, Nipun Kwatra, Ramachandran Ramjee
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a detailed study of metrics across multiple compression techniques, models and datasets, demonstrating that the behavior of compressed models as visible to end-users is often significantly different from the baseline model, even when accuracy is similar. We further evaluate compressed models both qualitatively and quantitatively using MT-Bench and show that compressed models exhibiting high flips are worse than baseline models in this free-form generative task. |
| Researcher Affiliation | Industry | Abhinav Dutta Microsoft Research Bangalore, India t-abdutta@microsoft.com Sanjeev Krishnan Microsoft Research Bangalore, India sakrishnan@microsoft.com Nipun Kwatra Microsoft Research Bangalore, India nipun.kwatra@microsoft.com Ramachandran Ramjee Microsoft Research Bangalore, India ramjee@microsoft.com |
| Pseudocode | No | The paper describes concepts and evaluations but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | Our work uses existing open sourced code and does not need any private code for reproduction. |
| Open Datasets | Yes | MMLU (Hendrycks et al., 2021a), Hellaswag (Zellers et al., 2019), ARC (Clark et al., 2018), LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2019), Winogrande (Sakaguchi et al., 2019)... GSM8k (Cobbe et al., 2021b), Trivia QA (Joshi et al., 2017), MT-Bench (Zheng et al., 2023)... MATH (Hendrycks et al., 2021b), BFCL (Yan et al., 2024), and Scrolls-Quality (Shaham et al., 2022). |
| Dataset Splits | No | The paper mentions few-shot evaluation (e.g., MMLU 5-shot), but does not explicitly provide training/validation splits for the datasets it uses for evaluation. |
| Hardware Specification | No | The actual type of compute resources used is irrelevant to the evaluations (not accounting for floating point errors) |
| Software Dependencies | Yes | LLM.int8() (Dettmers et al., 2022) as implemented in Bitsandbytes (Dettmers, 2024)... We used GPTQ (Frantar et al., 2023), AWQ (Lin et al., 2024)... We used Smoothquant (Xiao et al., 2024)... We use Tensor RT (NVIDIA, 2024) for Smooth Quant, all other schemes were evaluated using Hugging Face Transformers (Wolf et al., 2020)... we have used GPT-4 (Open AI et al., 2024) (v0314) as judge. |
| Experiment Setup | Yes | We used GPTQ (Frantar et al., 2023), AWQ (Lin et al., 2024) with group-size 128 with other parameters being default. We used Smoothquant (Xiao et al., 2024) (referred to as SQ W8A8) with per-token, per-channel quantization using α = 0.5. ...results on all models use greedy decoding, making these results fully deterministic. |