Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving
Authors: Foteini Strati, Sara Mcallister, Amar Phanishayee, Jakub Tarnawski, Ana Klimovic
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate D ej a Vu under different use cases. In pipeline parallel configurations without failures, D ej a Vu improves LLM serving throughput by up to 2 compared to Faster Transformer. |
| Researcher Affiliation | Collaboration | Foteini Strati 1 2 Sara Mc Allister 1 3 Amar Phanishayee 4 Jakub Tarnawski 4 Ana Klimovic 2 1MSR Project Fiddle Intern 2ETH Zurich 3Carnegie Mellon University 4Microsoft Research. |
| Pseudocode | No | The paper describes algorithms and processes in text and figures but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | D ej a Vu is available at https://github.com/msr-fiddle/dejavu. |
| Open Datasets | Yes | We use one client, which submits requests following a Poisson distribution in an open loop, with varying request rates. We use microbatch size of 8 in all cases. Similarly to Orca (Yu et al., 2022) and v LLM (Kwon et al., 2023) we report normalized latency (seconds/token) for each request rate. ... and we sample the number of newly generated tokens from the LMSys dataset (Zheng et al., 2023), assuming all requests within a microbatch generate the same number of tokens. |
| Dataset Splits | No | The paper mentions using the LMSys dataset but does not explicitly provide train/validation/test splits (percentages or counts) or reference standard predefined splits for replication. |
| Hardware Specification | Yes | Setup We use VMs with 2 A100-80GB GPUs, and inter VM network bandwidth of 40 Gbps. ... In this experiment, we also evaluate OPT-30B on VMs with 2 V100-16GB GPUs, and inter-VM network bandwidth of 32 Gbps. |
| Software Dependencies | No | The paper mentions software like Faster Transformer, Hugging Face versions of models, NCCL, gRPC, MPI, and Boost, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | We configure all our requests to a fixed prompt size (1000 tokens for Figure 8), and we sample the number of newly generated tokens from the LMSys dataset (Zheng et al., 2023)... We use microbatch size of 8 in all cases. ... Each request has a prompt size of 500 tokens and generates 1000 extra tokens. |