Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders
Authors: Yuezhou Hu, Jiaxin Guo, Xinyu Feng, Tuo Zhao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Ada SPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that Ada SPEC consistently outperforms the state-of-the-art Distill Spec method, achieving higher acceptance rates across all tasks (up to 15%). We conduct extensive experiments on a wide range of models and downstream tasks, where we benchmark Ada SPEC against Distill Spec and find that Ada SPEC sucessfully pushes the limit of SD across all tasks and model setups, Ada SPEC consistently achieves higher acceptance rates (up to 15%; see Table 1). |
| Researcher Affiliation | Academia | Yuezhou Hu1 , Jiaxin Guo2 , Xinyu Feng3, Tuo Zhao3 1 University of California, Berkeley 2 Tsinghua University 3Georgia Institute of Technology EMAIL EMAIL EMAIL |
| Pseudocode | Yes | A.1 Full Algorithms for Ada SPEC Algorithm 1 Greedy Speculative Decoding 1: Input: target model Mp, draft model Mq, input sequence x 2: accept 0, reject 0, t len(x) ... Algorithm 2 Ada SPEC: Selective Distillation for Speculative Decoding 1: Input: dataset D, target model Mp, draft model Mq, fraction k 2: Step 1. Fine-tune Mp: ... |
| Open Source Code | Yes | The code is publicly available at https://github.com/yuezhouhu/adaspec. |
| Open Datasets | Yes | We test these two configurations on a diverse set of five tasks, each representative of a specific domain to provide a robust evaluation framework for Ada SPEC: GSM8K [9] (A benchmark for multi-step arithmetic reasoning), Alpaca [27] (A comprehensive instruction following dataset), MBPP [3] (A Python programming challenge set for code generation), CNN/Daily Mail [22] (A long-form summarization task), and XSUM [23] (An extreme summarization challenge). |
| Dataset Splits | No | The paper does not explicitly provide specific percentages or sample counts for training, validation, and test splits for any of the datasets used. It mentions using a 'validation set' for choosing optimal epochs, but does not quantify its size or methodology for splitting. |
| Hardware Specification | Yes | To investigate Ada SPEC s potential to accelerate end to end decoding in a real world setting, we use frontier inference engine v LLM [15] on one single A100 GPU and report speed-up in Table 5. ... Table 10: GPU hours of training models on A100 GPUs. |
| Software Dependencies | No | The paper mentions PyTorch in Listing 2 for the code implementation but does not specify a version number. No other software dependencies are listed with specific version numbers. |
| Experiment Setup | Yes | A.2 Implementation Details We use the hyperparameters in Table 9. For 3-Epoch setting, both reference and draft model are distilled for 3 epochs. For Optimal-Epoch setting, the target model is first fine-tuned to maximize performance on validation set. Specifically, for GSM8K, the number of target epochs is chosen according to validation accuracy, while for the rest of the experiments it is chosen according to validation perplexity. Afterwards, we distill the reference model and pick the one with highest α on validation set. Eventually, this model serves as the reference model to train our draft model. For robustness, we only select the optimal epoch from 1, 3, 6, 10, 15, 20 and 30 (for XSUM and CNN/Daily Mail we select from 1, 3, 6, 10 for training efficiency). ... Table 9: Experimental hyperparameters. Task Hyperparameter 3-Epoch Optimal-Epoch ... Batch size 16 ... Learning rate 3e-4 ... Epochs for target model 3 ... Filter fraction k 0.4 |