Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Authors: Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo Ponti
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We retrofit pretrained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to ~3.7 throughput increase during auto-regressive inference on an NVIDIA H100 GPU. ... We evaluate our DMC models on a series of downstream tasks such as MMLU (Hendrycks et al., 2021) for factuality, QA datasets for common-sense reasoning, and Human Eval (Chen et al., 2021) for code. We find that DMC LLMs retain a downstream performance similar to the original LLM, whereas baselines such as GQA, H2O, and TOVA incur significant degradation at high compression ratios. |
| Researcher Affiliation | Collaboration | QNVIDIA KUniversity of Wrocław V University of Edinburgh |
| Pseudocode | Yes | Algorithm 1 Single-head KV cache update with Dynamic Memory Compression (DMC) |
| Open Source Code | Yes | We release the DMC code and models at https://github.com/ NVIDIA/Megatron-LM/tree/DMC. |
| Open Datasets | Yes | We evaluate models on a series of downstream tasks, including MMLU (Hendrycks et al., 2021) for factuality, Human Eval (Chen et al., 2021) for Python code generation, and several question-answering datasets for common-sense reasoning: PIQA (Bisk et al., 2020), Bool Q (Clark et al., 2019), Arc-C and Arc-E (Clark et al., 2018), Hella Swag (Zellers et al., 2019), and Wino Grande (Sakaguchi et al., 2020). |
| Dataset Splits | Yes | We evaluate models on a series of downstream tasks, including MMLU (Hendrycks et al., 2021) for factuality, Human Eval (Chen et al., 2021) for Python code generation, and several question-answering datasets for common-sense reasoning: PIQA (Bisk et al., 2020), Bool Q (Clark et al., 2019), Arc-C and Arc-E (Clark et al., 2018), Hella Swag (Zellers et al., 2019), and Wino Grande (Sakaguchi et al., 2020). We report the 5-shot performance on MMLU, average pass@1 scores for Human Eval, and average 0-shot performance on common-sense benchmarks (CS-QA). |
| Hardware Specification | Yes | We run measurements on a single GPU (NVIDIA A100 80GB SXM or H100 SXM) in bfloat16 precision for Llama 7B and 13B. For Llama 70B, we run the same measurements on two GPUs of the same type with tensor parallelism. |
| Software Dependencies | No | We verify whether increased CRs result in concrete efficiency gains, in Figure 5 we present the performance properties of DMC, estimated within the NVIDIA Megatron-LM framework (Narayanan et al., 2021). We rely on the memory efficient implementation of MHSA from Py Torch. No version numbers for software are specified. |
| Experiment Setup | Yes | We employ the Adam W optimizer with parameters β1 = 0.9, β2 = 0.95, and ϵ = 1e 5, in conjunction with a weight decay of 0.1 and gradient clipping of 1.0. The batch size is 1024 with a sequence length of 4096. We apply a constant learning rate identical to the final rate from the original Llama 2 pre-training phase: 3 10 5 for the 7B and 13B models, and 1.5 10 5 for the 70B model. We set the constant from Equation (8) as c = 5 which in practice results in αt = 0.0067 and ωt = (1 0.0067). Empirically, c = 5 is a high enough value so that we do not experience a spike in language modeling loss at the start, yet low enough to be easily changed by learning qt[0] and kt[0] through gradient optimization. Finally, we set the window size (Section 3.3) to 12, and keep the Gumbel-sigmoid temperature constant at 0.1 throughout the entire training. |