DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving
Authors: Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate D ej a Vu under different use cases. In pipeline parallel configurations without failures, D ej a Vu improves LLM serving throughput by up to 2 compared to Faster Transformer. |
| Researcher Affiliation | Collaboration | Foteini Strati 1 2 Sara Mc Allister 1 3 Amar Phanishayee 4 Jakub Tarnawski 4 Ana Klimovic 2 1MSR Project Fiddle Intern 2ETH Zurich 3Carnegie Mellon University 4Microsoft Research. |
| Pseudocode | No | The paper describes algorithms and processes in text and figures but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | D ej a Vu is available at https://github.com/msr-fiddle/dejavu. |
| Open Datasets | Yes | We use one client, which submits requests following a Poisson distribution in an open loop, with varying request rates. We use microbatch size of 8 in all cases. Similarly to Orca (Yu et al., 2022) and v LLM (Kwon et al., 2023) we report normalized latency (seconds/token) for each request rate. ... and we sample the number of newly generated tokens from the LMSys dataset (Zheng et al., 2023), assuming all requests within a microbatch generate the same number of tokens. |
| Dataset Splits | No | The paper mentions using the LMSys dataset but does not explicitly provide train/validation/test splits (percentages or counts) or reference standard predefined splits for replication. |
| Hardware Specification | Yes | Setup We use VMs with 2 A100-80GB GPUs, and inter VM network bandwidth of 40 Gbps. ... In this experiment, we also evaluate OPT-30B on VMs with 2 V100-16GB GPUs, and inter-VM network bandwidth of 32 Gbps. |
| Software Dependencies | No | The paper mentions software like Faster Transformer, Hugging Face versions of models, NCCL, gRPC, MPI, and Boost, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We configure all our requests to a fixed prompt size (1000 tokens for Figure 8), and we sample the number of newly generated tokens from the LMSys dataset (Zheng et al., 2023)... We use microbatch size of 8 in all cases. ... Each request has a prompt size of 500 tokens and generates 1000 extra tokens. |