reproducibilityindex.ai

KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation

Authors: Minsik Cho, Mohammad Rastegari, Devang Naik

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Compared with an existing parallelization scheme such as tensor or sequential parallelization where keys and values are locally generated and exchanged via all-gather collectives, our experimental results demonstrate that KV-Runahead can offer over 1.4 and 1.6 speedups for Llama 7B and Falcon 7B respectively.
Researcher Affiliation	Industry	1Apple. USA 2Meta. USA (the work done while being with Apple). Correspondence to: Minsik Cho <minsik@apple.com>.
Pseudocode	Yes	Fig. 7 shows the pseudocode/computational graph without and with KV-Runahead.
Open Source Code	No	The paper does not provide a specific link or explicit statement about the release of its source code.
Open Datasets	No	The paper evaluates inference performance of pre-trained LLMs on varying context lengths, but does not provide access information for a public dataset or its training split.
Dataset Splits	No	The paper evaluates inference performance of pre-trained LLMs, but does not define or provide access to dataset splits for validation.
Hardware Specification	Yes	All our experiments were done on a single node with 8 NVidia A100 GPUs, and under high (300GB/s) and low (10GB/s) bandwidth setups.
Software Dependencies	Yes	We used Py Torch 2.0 (Paszke et al., 2019) and NCCL 2.14 to enable KV-Runahead in Huggingface LLa MA 7B and Falcon 7B (Touvron et al., 2023; Almazrouei et al., 2023).
Experiment Setup	Yes	We used FP16 for the inference.