Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Sim-LLM: Optimizing LLM Inference at the Edge through Inter-Task KV Reuse

Authors: Ruikun Luo, Changwei Gu, Qiang He, Feifei Chen, Song Wu, Hai Jin, Yun Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluated on both A40 and A100 GPUs, Sim-LLM achieves a system throughput improvement of up to 39.40% and a memory reduction of up to 34.65%, compared to state-of-the-art approaches. Our source code is available at https://github.com/CGCL-codes/Sim LLM. 1 Introduction ... Section 4 Experiments
Researcher Affiliation Academia Ruikun Luo1234, Changwei Gu1234, Qiang He1234 , Feifei Chen5, Song Wu1234, Hai Jin1234, Yun Yang6 1National Engineering Research Center for Big Data Technology and System 2Services Computing Technology and System Lab 3Cluster and Grid Computing Lab 4School of Computer Science and Technology, Huazhong University of Science and Technology 5Deakin University 6Swinburne University of Technology EMAIL, EMAIL, EMAIL
Pseudocode Yes C Algorithm Details We provide the algorithm details of Sim-LLM in Algorithm 1. For each batch of tasks, after preprocessing, tasks are mapped to an LSH bucket, where similarity matching is performed. If a match is found, the KV of the similar task is reused to accelerate inference; otherwise, normal inference is performed with sandwich configuration. The hash value, embedding value and top-layer KV for each processed task are stored in the KV_Manager for future task reuse (LRU eviction is adopted when reaching cache size). After each batch is processed, each server updates its task prototype and sends it to other servers to maintain the global feature table. Algorithm 1: Sim-LLM
Open Source Code Yes Our source code is available at https://github.com/CGCL-codes/Sim LLM.
Open Datasets Yes Among them, REDDIT [12] contains the comments of 50 high-quality subreddits from the REDDIT Push Shift data dumps (from 2006 to 2023), and we use the data collected after 2020 to ensure the up-to-dateness. MMChat [13] and LCCC [14] are conversation dialogues collected from Weibo, PTT, and Douban. ... All experiments related to the PPL evaluation are conducted on a 10M subset of the development set from Slim Pajama [44] and the Wikipedia dataset [45].
Dataset Splits Yes The evaluation is conducted with the official scripts from Open Compass, employing a zero-shot approach without additional training. Two evaluation modes are utilized: perplexity (PPL) and generation (GEN)2. ... All experiments related to the PPL evaluation are conducted on a 10M subset of the development set from Slim Pajama [44] and the Wikipedia dataset [45]. ... We conducted zero-shot accuracy evaluation on benchmarks discussed in Section 4.1, using the Tiny Llama-1.1B, Llama2-7B, and Llama2-13B with official scripts from the lm-eval-harness framework [32].
Hardware Specification Yes The performance of Sim-LLM under single-node scenarios is evaluated on a server with a single Nvidia A100 80GB GPU. Four physical machines, each equipped with four Nvidia A40 40GB GPUs, are used as edge servers to evaluate the performance across edge nodes.
Software Dependencies No The experiment is conducted in the Open Compass evaluation framework contributors [31] and the lm-eval-harness framework [32].
Experiment Setup Yes Both the prompt length and generation length are set to 2,048. ... The prompt length and generation length are both 2048 in (a) & (b). ... The cache size is set to 512. ... with a fixed prompt length of 512 and a generation length of 4,096. ... In Figure 12a, the prompt length and the generation length are set to 512 and 4,096, respectively. ... Sim-LLM employs a stringent threshold to offset performance degradation after KV reusing while maintaining inference efficiency, considering two tasks to be similar when their cosine similarity exceeds 0.8. ... KV_Manager adopts the Least-Recently-Used (LRU) eviction policy that preferentially preserves frequently reused task KVs while removing those accessed least recently.