Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
OmniDraft: A cross-vocabulary, online adaptive drafter for on-device speculative decoding
Authors: Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Jay Zhuo, Chen Feng, Yicheng Lin, Chenzheng Su, Xiaopeng Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We showcase the proficiency of the Omni Draft framework by performing online learning on math reasoning, coding and text generation tasks. Through extensive experiments, we show that a single Llama-68M draft model can be paired with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for cross-vocabulary speculative decoding and provides up to 1.5-2x speedup on reasoning, coding and text generation tasks. Our empirical results also show good performance across all metrics. |
| Researcher Affiliation | Industry | Ramchalam Kinattinkara Ramakrishnan , 1 Zhaocong Yuan , 2, Shaojie Zhuo3, Chen Feng4, Yicheng Lin5, Chenzheng Su6, Xiaopeng Zhang7 Qualcomm AI Research {1rkinatti, 2zhaocong, 3shaojiez, 4chenf, 5yichengl, 6chenzhen, 7xiaopeng} @qti.qualcomm.com |
| Pseudocode | Yes | We summarize the modified speculative decoding with our proposed mappings in Algorithm 1. Algorithm 1 Cross-vocabulary Speculative Decoding ... We show the detailed algorithm for cross-vocabulary distillation in Algorithm 2. ... We show the detailed algorithms for the two variants of online adaptive drafting training in Algorithm 3 and 4. |
| Open Source Code | No | The paper uses open-source datasets, which are all available on huggingface. Links to all datasets are provided. We definitely want to provide access to code. However, it takes time for corporate legal team to review and approve. If reviewers feel necessary, we will try our best to accelerate the process of releasing code |
| Open Datasets | Yes | Tasks We perform online distillation across 4 tasks: GSM8K [12], Alpaca [44], XSum [36]and a combined MBPP+Human Eval [4][10] datasets. Each task has a dedicated train and test set or we slice out the a portion of the train set as the test set. For the MBPP+Human Eval, we combine the two datasets to add some more diversity to the data for the coding tasks. ... The paper uses open-source datasets, which are all available on huggingface. Links to all datasets are provided. |
| Dataset Splits | Yes | Table 9: Dataset details Dataset GSM8K MBPP+Human Eval Alpaca XSum train 8K 1K 8K 4K/8K test 200 228 100 100 |
| Hardware Specification | Yes | Throughout our work, we use the environment setup with NVIDIA A100 GPU (40/80GB), Py Torch 2.1.0 framework, CUDA version 12.1, and Ubuntu 22.04 LTS. |
| Software Dependencies | Yes | Throughout our work, we use the environment setup with NVIDIA A100 GPU (40/80GB), Py Torch 2.1.0 framework, CUDA version 12.1, and Ubuntu 22.04 LTS. |
| Experiment Setup | Yes | Table 10: Training hyperparameters Hyperparameter Value batch size 8 learning rate (LR) 1e-4/2e-5 LR scheduler constant optimizer Adam W β1 0.9 β2 0.999 weight decay 0.01/0 epochs 1 (online) mixed precision FP16 Lo RA rank 32 temperature 0.01 |