Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GRIFFIN: Effective Token Alignment for Faster Speculative Decoding

Authors: Shijing Hu, Jingyang Li, Xingyu Xie, Zhihui Lu, Kim-chuan Toh, Pan Zhou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on LLa MA, Vicuna, Qwen and Mixtral models demonstrate that GRIFFIN achieves an average acceptance length improvement of over 8% and a speedup ratio exceeding 7%, outperforming current speculative decoding state-of-the-art methods. Our code and GRIFFIN s draft models will be released publicly in https://github.com/hsj576/GRIFFIN. The paper includes a dedicated "5 Experiments" section with subsections like "5.1 Comparison with So TAs" and "5.2 Ablation Study," along with tables and figures presenting empirical results.
Researcher Affiliation Academia 1Fudan University 2National University of Singapore 3 Singapore Management University. The affiliations listed are Fudan University, National University of Singapore, and Singapore Management University, all of which are academic institutions.
Pseudocode No The paper describes methods using mathematical equations and textual explanations but does not include any clearly labeled pseudocode or algorithm blocks. For example, Section 4.1 describes "Token-Alignable Training" with equations (1) to (5), and Section 4.2 describes the "Token-Alignable Draft Model" with equations (6) to (8), but these are not formatted as algorithms.
Open Source Code Yes Our code and GRIFFIN s draft models will be released publicly in https://github.com/hsj576/GRIFFIN. The code is included in https://github.com/hsj576/GRIFFIN, along with detailed guidelines for reproducing our experimental results.
Open Datasets Yes We follow priors and train our draft model on Share GPT dataset, with token-alignment set to top-k (k = 3). We assess performance on three key tasks: multi-turn conversation (MT-Bench [4]), code generation (Human Eval [5]), and mathematical reasoning (GSM8K [14]).
Dataset Splits No The paper mentions using Share GPT for training and MT-Bench, Human Eval, and GSM8K for evaluation. While these are standard datasets/benchmarks, the paper does not explicitly provide details on how these datasets were split into training, validation, or test sets by the authors for reproduction purposes (e.g., percentages, sample counts, or specific split files), nor does it specify if custom splits were generated. It relies on the inherent structure of these benchmarks for evaluation and uses Share GPT for training without further splitting details.
Hardware Specification Yes For consistency, all inference runs use one NVIDIA A100 80G GPU, except for LLa MA3-70B and Mixtral-8x7B, which require two GPUs. Using the LLa MA3-8B-Instruct model on an A100-80G GPU as a representative setup, we measure the forward-pass latency of the target model as approximately t = 25 ms and of the draft model as t = 1.5 ms.
Software Dependencies No The paper mentions using the AdamW optimizer and integrating with the open-source vLLM framework, but it does not specify any version numbers for these or other software components (e.g., Python, PyTorch, CUDA, vLLM itself). For example, it states: "The draft model is trained using the Adam W optimizer" in Appendix C.3.
Experiment Setup Yes The draft model is trained using the Adam W optimizer, with the following key settings: Learning rate: 3e-5; Batch size: 4 (per GPU); Number of epochs: 20; Total training steps: 800,000; Warmup: 2,000 steps of linear warmup; learning rate scheduler enabled; Optimizer: Adam W, with betas (0.9, 0.95); Gradient clipping: 0.5 (by value); Maximum sequence length: 2,048 tokens.