reproducibilityindex.ai

DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving

Authors: Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate D ej a Vu under different use cases. In pipeline parallel configurations without failures, D ej a Vu improves LLM serving throughput by up to 2 compared to Faster Transformer.
Researcher Affiliation	Collaboration	Foteini Strati 1 2 Sara Mc Allister 1 3 Amar Phanishayee 4 Jakub Tarnawski 4 Ana Klimovic 2 1MSR Project Fiddle Intern 2ETH Zurich 3Carnegie Mellon University 4Microsoft Research.
Pseudocode	No	The paper describes algorithms and processes in text and figures but does not include formal pseudocode or algorithm blocks.
Open Source Code	Yes	D ej a Vu is available at https://github.com/msr-fiddle/dejavu.
Open Datasets	Yes	We use one client, which submits requests following a Poisson distribution in an open loop, with varying request rates. We use microbatch size of 8 in all cases. Similarly to Orca (Yu et al., 2022) and v LLM (Kwon et al., 2023) we report normalized latency (seconds/token) for each request rate. ... and we sample the number of newly generated tokens from the LMSys dataset (Zheng et al., 2023), assuming all requests within a microbatch generate the same number of tokens.
Dataset Splits	No	The paper mentions using the LMSys dataset but does not explicitly provide train/validation/test splits (percentages or counts) or reference standard predefined splits for replication.
Hardware Specification	Yes	Setup We use VMs with 2 A100-80GB GPUs, and inter VM network bandwidth of 40 Gbps. ... In this experiment, we also evaluate OPT-30B on VMs with 2 V100-16GB GPUs, and inter-VM network bandwidth of 32 Gbps.
Software Dependencies	No	The paper mentions software like Faster Transformer, Hugging Face versions of models, NCCL, gRPC, MPI, and Boost, but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	We configure all our requests to a fixed prompt size (1000 tokens for Figure 8), and we sample the number of newly generated tokens from the LMSys dataset (Zheng et al., 2023)... We use microbatch size of 8 in all cases. ... Each request has a prompt size of 500 tokens and generates 1000 extra tokens.