Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Authors: Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. In the SGLang framework, EAGLE-3 achieves a 1.38x throughput improvement at a batch size of 64.
Researcher Affiliation Collaboration 1University of Waterloo 2Peking University 3Microsoft Research 4Vector Institute
Pseudocode No The paper describes methods using diagrams (Figure 3, Figure 5, Figure 6) and text, but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/Safe AILab/EAGLE.
Open Datasets Yes For multi-turn conversation, code generation, mathematical reasoning, instruction following, and summarization we chose the MT-bench [19], Human Eval [20], GSM8K [21], Alpaca [22], and CNN/Daily Mail [23] datasets, respectively. ... We use Share GPT and Ultra Chat-200K [24] as training data, containing approximately 68K and 464K data entries, respectively. ... For the reasoning model Deep Seek-R1-Distill-LLa MA 8B, we also use the Open Thoughts-114k-math dataset for training.
Dataset Splits Yes Following EAGLE [2] and Spec-Bench [18], we evaluate on five tasks, using the same weights for all tasks without fine-tuning on the respective tasks.
Hardware Specification Yes If not specified, we use the A100 GPU to test 70B models and the RTX 3090 for other models. The performance of EAGLE-3 for large batches on a single H100 GPU and LLa MA-Instruct 3.1 8B in the SGLang v0.4.4 environment [9] was evaluated in Table 3. We also tested the throughput of EAGLE-3 at batch size = 1 on H100 when the target model is LLa MA-Instruct 3.1 8B and the testing dataset is MT-bench. The results are shown in Table 4. ... and the results on RTX3090 and LLa MA-Instruct 3.1 8B are shown in Table 5.
Software Dependencies Yes The performance of EAGLE-3 for large batches on a single H100 GPU and LLa MA-Instruct 3.1 8B in the SGLang v0.4.4 environment [9] was evaluated in Table 3. ... We also conducted a study on the impact of EAGLE-3 on throughput for large batch sizes based on v LLM [29], a widely used production-grade framework, and the results on RTX3090 and LLa MA-Instruct 3.1 8B are shown in Table 5.
Experiment Setup Yes We use the Adam W optimizer, with beta values (β1, β2) set to (0.9, 0.95) and implemented gradient clipping of 0.5. The learning rate is set to 5e-5. We simulate 5 steps during training-time test. ... This part of the experiment did not use the tree structure, the chain length was set to 3 ... This part of the experiment did not use the tree structure, the maximum chain length was set to 2