KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
Authors: Minsik Cho, Mohammad Rastegari, Devang Naik
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Compared with an existing parallelization scheme such as tensor or sequential parallelization where keys and values are locally generated and exchanged via all-gather collectives, our experimental results demonstrate that KV-Runahead can offer over 1.4 and 1.6 speedups for Llama 7B and Falcon 7B respectively. |
| Researcher Affiliation | Industry | 1Apple. USA 2Meta. USA (the work done while being with Apple). Correspondence to: Minsik Cho <minsik@apple.com>. |
| Pseudocode | Yes | Fig. 7 shows the pseudocode/computational graph without and with KV-Runahead. |
| Open Source Code | No | The paper does not provide a specific link or explicit statement about the release of its source code. |
| Open Datasets | No | The paper evaluates inference performance of pre-trained LLMs on varying context lengths, but does not provide access information for a public dataset or its training split. |
| Dataset Splits | No | The paper evaluates inference performance of pre-trained LLMs, but does not define or provide access to dataset splits for validation. |
| Hardware Specification | Yes | All our experiments were done on a single node with 8 NVidia A100 GPUs, and under high (300GB/s) and low (10GB/s) bandwidth setups. |
| Software Dependencies | Yes | We used Py Torch 2.0 (Paszke et al., 2019) and NCCL 2.14 to enable KV-Runahead in Huggingface LLa MA 7B and Falcon 7B (Touvron et al., 2023; Almazrouei et al., 2023). |
| Experiment Setup | Yes | We used FP16 for the inference. |