reproducibilityindex.ai

Accuracy is Not All You Need

Authors: Abhinav Dutta, Sanjeev Krishnan, Nipun Kwatra, Ramachandran Ramjee

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a detailed study of metrics across multiple compression techniques, models and datasets, demonstrating that the behavior of compressed models as visible to end-users is often significantly different from the baseline model, even when accuracy is similar. We further evaluate compressed models both qualitatively and quantitatively using MT-Bench and show that compressed models exhibiting high flips are worse than baseline models in this free-form generative task.
Researcher Affiliation	Industry	Abhinav Dutta Microsoft Research Bangalore, India t-abdutta@microsoft.com Sanjeev Krishnan Microsoft Research Bangalore, India sakrishnan@microsoft.com Nipun Kwatra Microsoft Research Bangalore, India nipun.kwatra@microsoft.com Ramachandran Ramjee Microsoft Research Bangalore, India ramjee@microsoft.com
Pseudocode	No	The paper describes concepts and evaluations but does not include any pseudocode or algorithm blocks.
Open Source Code	No	Our work uses existing open sourced code and does not need any private code for reproduction.
Open Datasets	Yes	MMLU (Hendrycks et al., 2021a), Hellaswag (Zellers et al., 2019), ARC (Clark et al., 2018), LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2019), Winogrande (Sakaguchi et al., 2019)... GSM8k (Cobbe et al., 2021b), Trivia QA (Joshi et al., 2017), MT-Bench (Zheng et al., 2023)... MATH (Hendrycks et al., 2021b), BFCL (Yan et al., 2024), and Scrolls-Quality (Shaham et al., 2022).
Dataset Splits	No	The paper mentions few-shot evaluation (e.g., MMLU 5-shot), but does not explicitly provide training/validation splits for the datasets it uses for evaluation.
Hardware Specification	No	The actual type of compute resources used is irrelevant to the evaluations (not accounting for floating point errors)
Software Dependencies	Yes	LLM.int8() (Dettmers et al., 2022) as implemented in Bitsandbytes (Dettmers, 2024)... We used GPTQ (Frantar et al., 2023), AWQ (Lin et al., 2024)... We used Smoothquant (Xiao et al., 2024)... We use Tensor RT (NVIDIA, 2024) for Smooth Quant, all other schemes were evaluated using Hugging Face Transformers (Wolf et al., 2020)... we have used GPT-4 (Open AI et al., 2024) (v0314) as judge.
Experiment Setup	Yes	We used GPTQ (Frantar et al., 2023), AWQ (Lin et al., 2024) with group-size 128 with other parameters being default. We used Smoothquant (Xiao et al., 2024) (referred to as SQ W8A8) with per-token, per-channel quantization using α = 0.5. ...results on all models use greedy decoding, making these results fully deterministic.