Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs
Authors: Jinwoo Park, Seunggeun Cho, Dongsu Han
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our evaluation validates the effectiveness of this edge-assisted approach, demonstrating that Spec Edge achieves 1.91 better cost efficiency while increasing server-side throughput by 2.22 and reducing inter token latency by 11.24%. These improvements persist even under challenging wide-area network conditions, outperforming server-only baselines with zero network delays. We evaluate Spec Edge in an edge-assisted server configuration against a server-only configuration across various LLMs and datasets. Our findings are summarized as follows: Spec Edge enhances cost efficiency by an average of 1.91 compared to the server-only environment through increasing server throughput by 2.22 on average. It reduces the inter token latency by an average of 11.24%, even with a 14.07 ms round-trip time between the server and edge, outperforming the server-only configuration with no network delay. Implementation and Setup. Our system s edge-assisted configuration utilizes a server-side NVIDIA A100 GPU connected to multiple edge-side NVIDIA RTX4090 GPUs over a wide-area network. The number of RTX4090 GPUs scales with the number of concurrent requests (batch size x 2). In our experiments, we measured an average round-trip time (RTT) of 14.07ms between the local edge node and our Google Cloud instance. We conducted evaluations across various models and datasets under diverse operating conditions. |
| Researcher Affiliation | Academia | Jinwoo Park KAIST EMAIL Seunggeun Cho KAIST EMAIL Dongsu Han KAIST EMAIL |
| Pseudocode | No | The paper describes methods and processes in prose and with architectural diagrams (e.g., Figures 2, 4, 5). While it refers to a "tree construction algorithm" in Appendix A, there are no structured pseudocode blocks or algorithms explicitly labeled as such within the document. |
| Open Source Code | Yes | The code is available at https://github.com/kaist-ina/specedge |
| Open Datasets | Yes | Models and data sets. We use four different LLMs: Qwen3-32B/14B [Team, 2025], Vicuna33B [Chiang et al., 2023] and Llama2-13B-chat-hf [Touvron et al., 2023]. Unless specifically noted, all models are configured with a temperature setting of 0.7. For the draft models, we use five different models: Qwen3-1.7B/0.6B [Team, 2025], Sheared Llama-1.3B [Xia et al., 2023], Tiny Llama1.1B [Zhang et al., 2024], and Jack Fram-160M [Miao et al., 2024]. Finally, we use Spec Bench [Xia et al., 2024], C4 (en) [Raffel et al., 2020], Open Assistant conversations datasets [KΓΆpf et al., 2024], Wiki Text-2 [Merity et al., 2016], and MTBench [Zheng et al., 2023]. |
| Dataset Splits | No | The paper mentions several datasets (Spec Bench, C4, OAsst, Wiki Text-2, MTBench) and states "For each query, we generate up to 256 output tokens." However, it does not explicitly provide details on how these datasets were split into training, validation, or test sets for the experiments. It refers to "Spec Bench (spanning six different tasks)" but this describes benchmark tasks, not data splitting methodology. |
| Hardware Specification | Yes | Our system s edge-assisted configuration utilizes a server-side NVIDIA A100 GPU connected to multiple edge-side NVIDIA RTX4090 GPUs over a wide-area network. ... We measured Spec Edge performance using the RTX 3060 Ti and the RTX 2080 Ti. |
| Software Dependencies | No | The paper discusses various LLM frameworks and related work such as Deep Speed-Inference, Tensor RT-LLM, and vLLM. It also mentions different LLMs (Qwen3, Vicuna, Llama2) and draft models, as well as several datasets. However, it does not provide specific version numbers for any ancillary software, programming languages, or libraries used in their implementation (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Unless specifically noted, all models are configured with a temperature setting of 0.7. For the draft models, we use five different models: Qwen3-1.7B/0.6B [Team, 2025], Sheared Llama-1.3B [Xia et al., 2023], Tiny Llama1.1B [Zhang et al., 2024], and Jack Fram-160M [Miao et al., 2024]. Our primary baseline is a server-only configuration employing tree-based speculative decoding, supplemented by autoregressive decoding and a layer-split approach that offloads part of the LLM s layers to an edge device. ... In our main experiments, we use the draft tree size to 32 for each request. |