Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
HiFC: High-efficiency Flash-based KV Cache Swapping for Scaling LLM Inference
Authors: Inho Jeong, Sunghyeon Woo, Sol Namkung, Dongsuk Jeon
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrated that Hi FC maintains inference throughput comparable to traditional DRAM swapping methods. In addition, our extended validation confirms this advantage holds across diverse models, datasets, and even in multi-GPU scaling scenarios, highlighting Hi FC s robustness and general applicability. |
| Researcher Affiliation | Collaboration | Inho Jeong1,2 Sunghyeon Woo1,3 Sol Namkung1 Dongsuk Jeon1 1Seoul National University 2SK hynix 3NAVER Cloud |
| Pseudocode | No | The paper describes the architecture and operational flow of Hi FC with diagrams and descriptive text, but it does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps. |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide the complete code and benchmark scripts necessary to accurately reproduce our experimental results. The raw experimental data and logs presented in this paper are also included. |
| Open Datasets | Yes | To ensure consistent and accurate performance benchmarking, we first selected the Deep Seek R1-Distill-Qwen-32B (DS-Qwen-32B) model [5, 31] and conducted measurements under fixed configurations. To validate the robustness and general applicability of Hi FC, we then extended our evaluation to include diverse models and datasets, with these results presented in Section 5.3. ... [5] Deep Seek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. Co RR, abs/2501.12948, 2025. ... [31] Deep Seek AI. Deepseek-r1-distill-qwen-32b. https://huggingface.co/deepseek-ai/Deep Seek-R1-Distill-Qwen-32B, 2024. Accessed: 2025-05-15. ... Long Bench [16] ... [16] Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding, 2023. |
| Dataset Splits | No | The paper specifies workload characteristics for inference such as 'The workload consisted of a batch size of 200 from the Gov Report dataset (8k input length on average) with a fixed 1k output length.' and lists average lengths for datasets like 'Qasper (avg. 3.6k length) (avg. 8.7k length) (avg. 18.4k length)'. However, it does not explicitly provide traditional training/test/validation dataset splits, as the experiments focus on inference performance rather than model training or evaluation on specific data partitions. |
| Hardware Specification | Yes | Table 16: Hardware and software configuration used in our experiments. System & Hardware: Server / CPU Dell Power Edge R750xa, 2 Intel Xeon Silver 4310; GPU 2 A100 (80 Gi B); Memory 256 Gi B DDR4; OS Ubuntu 22.04.5 LTS; System Storage Samsung PM893, 7.68 TB SATA SSD; Flash cache SSD, 1 TB NVMe Gen4 (p SLC cache: 200 Gi B) |
| Software Dependencies | Yes | Table 16: Hardware and software configuration used in our experiments. Software & LLM Model: Baseline vLLM v0.6.6; CUDA Toolkit 12.3; PyTorch 2.5.1; GDS Release 1.8.1.2 |
| Experiment Setup | Yes | We trigger swap operations by progressively increasing both input and output sequence lengths and compare their impact on system performance. Fig. 3 presents a comparative analysis of KV cache swap performance between Hi FC and DRAM under various input and output configurations, where the throughput achieved by both memory types remains comparable. In Fig. 3a, as the input length increases... In contrast, Fig. 3b demonstrates that increasing output length... (Section 5.1). ...The workload consisted of a batch size of 200 from the Gov Report dataset (8k input length on average) with a fixed 1k output length (Section 5.2). ...We fixed the batch size to 20 and increased the number of concurrently pipelined sequences... (Section 5.4). ...context length = 5k, batch size = 20, block size = 32, and 100 Gi B of dedicated swap space for KV cache for both methods (Figure 4 caption). ...Table 6: Comparisons of LLM session initialization time. KV Cache Size (Gi B) 10 20 30 40 50 60 70 80 90 100. |