Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

Authors: Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate that DEJAVU can reduce the inference latency of OPT-175B by over 2 compared to the state-of-the-art Faster Transformer, and over 6 compared to the widely used Hugging Face implementation, without compromising model quality. The code is available at https: //github.com/FMInference/Deja Vu.
Researcher Affiliation Collaboration 1Rice University 2Zhe Jiang University 3Stanford University 4University of California, San Diego 5ETH Zurich 6Adobe Research 7Meta AI (FAIR) 8Carnegie Mellon University.
Pseudocode Yes Algorithm 1 Sparse Predictor Training Input: A pre-trained LLM block with parameter set M, token embedding set at block M ={xi}i [N], threshold t Sparse Predictor SP P+ , P for i=1 N do P+ P+ {(xi,mr) | mr M,mr(xi) t} P P {(xi,mr) | mr M,mr(xi)<t} end for SP TRAIN(P+,P ,L) L is a loss function
Open Source Code Yes The code is available at https: //github.com/FMInference/Deja Vu.
Open Datasets Yes We compare the accuracy of DEJAVU-OPT against the original OPT model on two language modeling datasets Wiki-Text (Merity et al., 2016) and C4 (Raffel et al., 2019) and seven few-shot downstream tasks: CB (de Marneffe et al., 2019), COPA (Gordon et al., 2012), Lambada (Radford et al., 2019), Open Book QA (Mihaylov et al., 2018), PIQA (Bisk et al., 2020), RTE (Giampiccolo et al., 2007), Winogrande (ai2, 2019). We use lm-evalharness (Gao et al., 2021) for zero-shot and five-shot tasks. We collect training data for the sparsity predictor using 500 random data points from the C4 training dataset.
Dataset Splits No The paper mentions using a "C4 training dataset" and evaluating on "zero-shot and five-shot tasks" which implies the use of a validation/test set. However, it does not provide explicit details about the exact percentages or counts for training, validation, and test splits for the main model evaluation, beyond specifying that 500 random data points from C4 training were used for the *sparsity predictor's* training data.
Hardware Specification Yes Taking OPT-175B as an example, the latency of one MLP block is only 0.2 ms on an 8 A100 80GB machine. Our experiments are conducted on NVIDIA A100 80GB GPU servers.
Software Dependencies No The paper mentions software components like "DEJAVU (written mostly in Python)", "Triton (Tillet et al., 2019)", "Py Torch", "Faster Transformer", "Hugging Face", and "lm-evalharness (Gao et al., 2021)". However, it does not provide specific version numbers for any of these software dependencies, which would be necessary for full reproducibility.
Experiment Setup Yes We collect training data for the sparsity predictor using 500 random data points from the C4 training dataset. Our experiments are conducted on NVIDIA A100 80GB GPU servers. We compare the accuracy of DEJAVU-OPT against the original OPT model on two language modeling datasets Wiki-Text (Merity et al., 2016) and C4 (Raffel et al., 2019) and seven few-shot downstream tasks: CB (de Marneffe et al., 2019), COPA (Gordon et al., 2012), Lambada (Radford et al., 2019), Open Book QA (Mihaylov et al., 2018), PIQA (Bisk et al., 2020), RTE (Giampiccolo et al., 2007), Winogrande (ai2, 2019). We use lm-evalharness (Gao et al., 2021) for zero-shot and five-shot tasks. Figure 7 presents the latency speed-up for the token generation with OPT-175B at batch size 1. At around 75% sparsity, DEJAVU speeds up generation. For DEJAVU-OPT-175B, we set the entire model sparsity at 75%. For quantization, we apply 4-bit quantization on model weights (W4A16).