Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
STree: Speculative Tree Decoding for Hybrid State Space Models
Authors: Yangchao Wu, Zongyue Qin, Alex Wong, Stefano Soatto
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Furthermore, we outperform vanilla speculative decoding with SSMs even with a baseline drafting model and tree structure on three different benchmarks, opening up opportunities for further speed up with SSM and hybrid model inference. Code can be find at: https://github.com/wyc1997/stree. Our contributions in this paper are to (i) propose what, to the best of our knowledge, is the first scalable method to leverage tree decoding in the speculative decoding for both SSMs and hybrid architectures; we also (ii) provide a simplified analysis of the trade-off between acceptance length and model runtime to help determine whether we should scale tree size or even use tree decoding. Finally, we (iii) empirically demonstrate that with a baseline drafting model and static tree structure, there are already improvements in generation speed, thus opening the door to further investigation of more advanced speculative decoding methods employed with transformers. 5 Experimental Results In this section, we aim to demonstrate the efficiency of STree on speculative decoding. |
| Researcher Affiliation | Academia | Yangchao Wu1 Zongyue Qin1 Alex Wong2 Stefano Soatto1 1UCLA 2Yale University Correspondence to EMAIL |
| Pseudocode | Yes | Algorithm 1 Speculative decoding with Tree Scan for SSMs 1: function SPECULATIVEDECODINGWITHTREESCAN 2: Initialize L : mask to indicate last accepted token 3: Initialize Cache : activation cache to facilitate recomputation of state 4: Initialize x : the correct state 5: while should_continue do 6: Li:j, ti:j DRAFT(ti 1) Draft a tree with last accepted token 7: x ACTIVATIONREPLAY(L , Cache) Recompute state up to the rejected tokens 8: t i:j, Cache TREESCAN(L, ti:j, x ) Getting output and cache from target model 9: L , ti:k FIRSTREJECTED(t i:j, ti:j) Accept/Reject Drafted tokens 10: end while 11: end function |
| Open Source Code | Yes | Code can be find at: https://github.com/wyc1997/stree. |
| Open Datasets | Yes | We perform this experiment on the MT_Bench [28] benchmarks for generating 100 tokens. ... We evaluate the speed of generating 1024 tokens on three different benchmarks: MT_Bench [28], Human Eval [3], and GSM8K [5], with two different temperatures: 0 (greedy) and 1. |
| Dataset Splits | No | The paper mentions evaluating generation speed on benchmarks (MT_Bench, Human Eval, GSM8K) for a fixed number of tokens (100 or 1024), but it does not specify any training/test/validation splits used by the authors for these benchmarks or for the models that were trained. |
| Hardware Specification | Yes | All experiments are run on an Nvidia RTX 3090 GPU. ... M=4/5 with N=16 for FSS ran Out Of Memory (OOM) on a 3090 GPU with 24GB of GPU memory. ... We further extend our method to H100 GPUs to demonstrate the our method is still applicable with better hardware. |
| Software Dependencies | No | The paper mentions models like Mamba2-2.7B and Mamba In Llama-8B, and algorithms like Fused Selective Scan (FSS) and Chunk Scan. However, it does not provide specific version numbers for any underlying software dependencies such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We measure the runtime of a forward pass through a Mamba2-2.7B model. For the input token tree, we use full binary trees of 4/5/6 layers deep, which contain 15/31/63 tokens in the tree respectively. ... We used a Mamba2-2.7B model as the target model and a Mamba2-130M model as the drafting model. We generate the token tree using beam search [11, 23], where we perform beam search with the drafting model and keep all the tokens generated at each step of beam search, even if the beam is later discarded. This results in an N-layer tree with M tokens at each layer, where N is the number of beam search steps and M is the number of beams. We verify the target model with greedy search, where at each step, the token with the maximum conditional likelihood from the target model is compared to the corresponding child nodes in the token tree to see if we should accept the draft. ... We evaluate the speed of generating 1024 tokens on three different benchmarks: MT_Bench [28], Human Eval [3], and GSM8K [5], with two different temperatures: 0 (greedy) and 1. ... For the vanilla speculative decoding baseline, we use the 2-layer model as the drafting model to draft 1 sequence of 4 tokens every step (target input size 1 5) and verify with the target model output using speculative sampling algorithm [? ]. For STree, we use the draft model to draft a static tree structure shown in Fig. 4a, and use multi-step speculative sampling (MSS sampling) [19] to verify with the target model output. Both methods use activation replay to backtrack the state. ... Ablation studies are performed on temperature, sampling algorithm, static tree structure (e.g., Max tree width, Tree depth, Number of tokens are specified in Table 4), percentage of transformer blocks, and different drafting model checkpoints (e.g., 12000 step, 48000 step, 264000 step). |