reproducibilityindex.ai

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Authors: Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo Ponti

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We retrofit pretrained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to ~3.7 throughput increase during auto-regressive inference on an NVIDIA H100 GPU. ... We evaluate our DMC models on a series of downstream tasks such as MMLU (Hendrycks et al., 2021) for factuality, QA datasets for common-sense reasoning, and Human Eval (Chen et al., 2021) for code. We find that DMC LLMs retain a downstream performance similar to the original LLM, whereas baselines such as GQA, H2O, and TOVA incur significant degradation at high compression ratios.
Researcher Affiliation	Collaboration	QNVIDIA KUniversity of Wrocław V University of Edinburgh
Pseudocode	Yes	Algorithm 1 Single-head KV cache update with Dynamic Memory Compression (DMC)
Open Source Code	Yes	We release the DMC code and models at https://github.com/ NVIDIA/Megatron-LM/tree/DMC.
Open Datasets	Yes	We evaluate models on a series of downstream tasks, including MMLU (Hendrycks et al., 2021) for factuality, Human Eval (Chen et al., 2021) for Python code generation, and several question-answering datasets for common-sense reasoning: PIQA (Bisk et al., 2020), Bool Q (Clark et al., 2019), Arc-C and Arc-E (Clark et al., 2018), Hella Swag (Zellers et al., 2019), and Wino Grande (Sakaguchi et al., 2020).
Dataset Splits	Yes	We evaluate models on a series of downstream tasks, including MMLU (Hendrycks et al., 2021) for factuality, Human Eval (Chen et al., 2021) for Python code generation, and several question-answering datasets for common-sense reasoning: PIQA (Bisk et al., 2020), Bool Q (Clark et al., 2019), Arc-C and Arc-E (Clark et al., 2018), Hella Swag (Zellers et al., 2019), and Wino Grande (Sakaguchi et al., 2020). We report the 5-shot performance on MMLU, average pass@1 scores for Human Eval, and average 0-shot performance on common-sense benchmarks (CS-QA).
Hardware Specification	Yes	We run measurements on a single GPU (NVIDIA A100 80GB SXM or H100 SXM) in bfloat16 precision for Llama 7B and 13B. For Llama 70B, we run the same measurements on two GPUs of the same type with tensor parallelism.
Software Dependencies	No	We verify whether increased CRs result in concrete efficiency gains, in Figure 5 we present the performance properties of DMC, estimated within the NVIDIA Megatron-LM framework (Narayanan et al., 2021). We rely on the memory efficient implementation of MHSA from Py Torch. No version numbers for software are specified.
Experiment Setup	Yes	We employ the Adam W optimizer with parameters β1 = 0.9, β2 = 0.95, and ϵ = 1e 5, in conjunction with a weight decay of 0.1 and gradient clipping of 1.0. The batch size is 1024 with a sequence length of 4096. We apply a constant learning rate identical to the final rate from the original Llama 2 pre-training phase: 3 10 5 for the 7B and 13B models, and 1.5 10 5 for the 70B model. We set the constant from Equation (8) as c = 5 which in practice results in αt = 0.0067 and ωt = (1 0.0067). Empirically, c = 5 is a high enough value so that we do not experience a spike in language modeling loss at the start, yet low enough to be easily changed by learning qt[0] and kt[0] through gradient optimization. Finally, we set the window size (Section 3.3) to 12, and keep the Gumbel-sigmoid temperature constant at 0.1 throughout the entire training.