DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving

Authors: Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate D ej a Vu under different use cases. In pipeline parallel configurations without failures, D ej a Vu improves LLM serving throughput by up to 2 compared to Faster Transformer.
Researcher Affiliation Collaboration Foteini Strati 1 2 Sara Mc Allister 1 3 Amar Phanishayee 4 Jakub Tarnawski 4 Ana Klimovic 2 1MSR Project Fiddle Intern 2ETH Zurich 3Carnegie Mellon University 4Microsoft Research.
Pseudocode No The paper describes algorithms and processes in text and figures but does not include formal pseudocode or algorithm blocks.
Open Source Code Yes D ej a Vu is available at https://github.com/msr-fiddle/dejavu.
Open Datasets Yes We use one client, which submits requests following a Poisson distribution in an open loop, with varying request rates. We use microbatch size of 8 in all cases. Similarly to Orca (Yu et al., 2022) and v LLM (Kwon et al., 2023) we report normalized latency (seconds/token) for each request rate. ... and we sample the number of newly generated tokens from the LMSys dataset (Zheng et al., 2023), assuming all requests within a microbatch generate the same number of tokens.
Dataset Splits No The paper mentions using the LMSys dataset but does not explicitly provide train/validation/test splits (percentages or counts) or reference standard predefined splits for replication.
Hardware Specification Yes Setup We use VMs with 2 A100-80GB GPUs, and inter VM network bandwidth of 40 Gbps. ... In this experiment, we also evaluate OPT-30B on VMs with 2 V100-16GB GPUs, and inter-VM network bandwidth of 32 Gbps.
Software Dependencies No The paper mentions software like Faster Transformer, Hugging Face versions of models, NCCL, gRPC, MPI, and Boost, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes We configure all our requests to a fixed prompt size (1000 tokens for Figure 8), and we sample the number of newly generated tokens from the LMSys dataset (Zheng et al., 2023)... We use microbatch size of 8 in all cases. ... Each request has a prompt size of 500 tokens and generates 1000 extra tokens.