Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Traversal Verification for Speculative Tree Decoding

Authors: Yepeng Weng, Qiao Hu, Xujie Chen, Li Liu, Dianwen Mei, Huishi Qiu, Jiang Tian, zhongchao shi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted experiments on Llama3 [10] series and Llama2 [29] using various tree structures. The experiments were performed on the Spec-Bench dataset [32], which encompasses six different tasks: multi-turn conversation, translation, summarization, question answering, mathematical reasoning, and retrieval-augmented generation. Experimental results demonstrate that Traversal Verification consistently outperforms existing decoding methods by 2.2%-5.7% in average acceptance length across diverse tasks with different tree architectures.
Researcher Affiliation Collaboration Yepeng Weng1 Qiao Hu2 Xujie Chen1 Li Liu1 Dianwen Mei1 Huishi Qiu1 Jiang Tian1 Zhongchao Shi1 1Lenovo AI Technology Center, Lenovo 2 National Center for Mathematics and Interdisciplinary Sciences (NCMIS), AMSS, CAS
Pseudocode Yes Algorithm 1 Single-token verification Algorithm 2 Recursive Rejection Sampling Algorithm 3 Traversal Verification
Open Source Code No The entire codebase is proprietary due to our company policy, but maybe we are able to release a portion of it in the future if permitted.
Open Datasets Yes We perform experiments on the Spec-Bench dataset [32], which includes 80 instances from each of six distinct domains: multi-turn conversation, translation (WMT14 DE-EN [1]), summarization (CNN/Daily Mail [24]), question answering (Natural Questions [18]), Mathematical reasoning (GSM8K [5]), retrieval-augmented generation (DPR [16]).
Dataset Splits Yes We perform experiments on the Spec-Bench dataset [32], which includes 80 instances from each of six distinct domains: multi-turn conversation, translation, summarization, question answering, mathematical reasoning, and retrieval-augmented generation.
Hardware Specification Yes All experiments are conducted on a single NVIDIA RTX A6000 GPU with Py Torch backend.
Software Dependencies No All experiments are conducted on a single NVIDIA RTX A6000 GPU with Py Torch backend. For token-level tree verification, we adopt the RRSw implementation in EAGLE [21] from Spec-Bench [32] open source repository.
Experiment Setup Yes Target LLMs and draft model. We mainly conduct experiments on the Llama3 [10] series, using Llama3.2-1B-Instruct as the draft model and Llama3.1-8B-Instruct as the target model. We also include Llama-68M [23] with Llama2-7b [29] as the draft and target model, which is widely adopted in existing speculative decoding researches [4, 12, 13, 26]... For chain and binary tree, we set the depth at 5, which is equal to the maximum depth of EAGLE sparse tree... For chain decoding, we conduct experiments at depths of 2, 4, 6, and 8. For tree decoding, we employ binary trees from depths of 2 to 5 (corresponding to trees with 23-1, 24-1, 25-1, and 26-1 nodes, respectively).