Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits
Authors: Yeshwanth Venkatesha, Souvik Kundu, Priyadarshini Panda
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using an NVIDIA Jetson Nano (client) and an A100 GPU (server) with Vicuna-68M (draft) and Llama2-7B (target) models, our method achieves up to a 35% reduction in latency compared to cloud-based autoregressive decoding, with an additional 11% improvement from preemptive drafting. To demonstrate real-world applicability, we deploy our method on the Unitree Go2 quadruped robot using Vision-Language Model (VLM) based control, achieving a 21% speedup over traditional cloud-based autoregressive decoding. |
| Researcher Affiliation | Collaboration | Yeshwanth Venkatesha EMAIL Department of Electrical Engineering Yale University New Haven, CT, USA Souvik Kundu EMAIL Intel Labs San Diego, CA, USA Priyadarshini Panda EMAIL Department of Electrical Engineering Yale University New Haven, CT, USA |
| Pseudocode | Yes | For detailed system design and pseudocode please refer to Appendix A. Appendix A System Design Algorithm 1 Client-Side Algorithm Algorithm 2 Server-Side Algorithm |
| Open Source Code | No | The paper does not explicitly provide a link to a code repository or an unambiguous statement of code release by the authors for their methodology. It mentions an "open-source implementation of Spatial VLM Chen et al. (2024)" but this refers to a third-party tool. |
| Open Datasets | Yes | For language generation models, we train early exit adapters on the publicly available Share GPT conversation dataset (hf:Ryoko AI/Share GPT52K) using a single NVIDIA A100 GPU with 80GB of VRAM. Additionally, we train early exit adapters for a vision-language model based on Qwen2VL-7B using the Spacellava dataset (hf:remyxai/vqasynth_spacellava), which is generated using an open-source implementation of Spatial VLM Chen et al. (2024). We show the experiments on 6 standard generative task benchmarks spanning conversation Zheng et al. (2023), code generation Chen et al. (2021), mathematical reasoning Cobbe et al. (2021), instruction following Taori et al. (2023), summarization Nallapati et al. (2016), and question-answering tasks Kwiatkowski et al. (2019). |
| Dataset Splits | No | The paper states: "We fine-tune three models Vicuna-7B, Vicuna-13B, and Llama2-7B for 10 epochs each, using a batch size of 1 and a learning rate of 1e-4." and mentions using "6 standard generative task benchmarks". However, it does not explicitly provide specific percentages, sample counts, or clear predefined splits for the training, validation, or testing phases of these datasets. |
| Hardware Specification | Yes | Server Side Hardware: We utilize a high performance computing cluster node equipped with a single A100 GPU with 80GB VRAM, 16 CPU cores, and 8GB of CPU memory per core as our server. Client Side Hardware: We demonstrate our system on two types of client devices: 1. NVIDIA Jetson Nano: A compact AI development board tailored for edge computing. It includes a quad-core ARM Cortex-A57 CPU, a 128-core Maxwell GPU, and 4GB of LPDDR4 RAM shared between the CPU and GPU. ... 2. Cluster Node with RTX 2080 Ti: This setup features a single RTX 2080 Ti GPU with 12GB VRAM, an 8-core CPU, and 4GB of RAM per core... Robotics Case Study: Vision-Language Navigation on Unitree Go2 ... This platform features an onboard NVIDIA Jetson Orin board, which includes an 8-core ARM Cortex-A78AE v8.2 64-bit CPU and 16GB of 128-bit LPDDR5 unified memory... |
| Software Dependencies | No | The paper states: "Our proof-of-concept is implemented in Python, which offers ease of experimentation but leaves room for performance optimization." It mentions Python as the programming language but does not provide specific version numbers for any libraries, frameworks (e.g., PyTorch, TensorFlow), or other key software components used in their methodology. |
| Experiment Setup | Yes | We fine-tune three models Vicuna-7B, Vicuna-13B, and Llama2-7B for 10 epochs each, using a batch size of 1 and a learning rate of 1e-4. Unless otherwise specified, we use γ = 4 and n = 200 in our experiments. |