KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation

Authors: Minsik Cho, Mohammad Rastegari, Devang Naik

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Compared with an existing parallelization scheme such as tensor or sequential parallelization where keys and values are locally generated and exchanged via all-gather collectives, our experimental results demonstrate that KV-Runahead can offer over 1.4 and 1.6 speedups for Llama 7B and Falcon 7B respectively.
Researcher Affiliation Industry 1Apple. USA 2Meta. USA (the work done while being with Apple). Correspondence to: Minsik Cho <minsik@apple.com>.
Pseudocode Yes Fig. 7 shows the pseudocode/computational graph without and with KV-Runahead.
Open Source Code No The paper does not provide a specific link or explicit statement about the release of its source code.
Open Datasets No The paper evaluates inference performance of pre-trained LLMs on varying context lengths, but does not provide access information for a public dataset or its training split.
Dataset Splits No The paper evaluates inference performance of pre-trained LLMs, but does not define or provide access to dataset splits for validation.
Hardware Specification Yes All our experiments were done on a single node with 8 NVidia A100 GPUs, and under high (300GB/s) and low (10GB/s) bandwidth setups.
Software Dependencies Yes We used Py Torch 2.0 (Paszke et al., 2019) and NCCL 2.14 to enable KV-Runahead in Huggingface LLa MA 7B and Falcon 7B (Touvron et al., 2023; Almazrouei et al., 2023).
Experiment Setup Yes We used FP16 for the inference.