EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Authors: Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted comprehensive evaluations of EAGLE, including all models from the Vicuna and LLa MA2-Chat series, the Mo E model Mixtral 8x7B Instruct, and tasks in dialogue, code generation, mathematical reasoning, and instruction following.
Researcher Affiliation Collaboration 1Peking University 2University of Waterloo 3Microsoft Research 4Vector Institute.
Pseudocode Yes Algorithm 1 Multi-round speculative sampling
Open Source Code Yes The code is available at https://github.com/ SafeAILab/EAGLE.
Open Datasets Yes We evaluated EAGLE across multiple tasks including multi-turn dialogue, code generation, mathematical reasoning, and instruction following, employing the MT-bench (Zheng et al., 2023), Human Eval (Chen et al., 2021), GSM8K (Cobbe et al., 2021), and Alpaca (Taori et al., 2023) datasets, respectively. ... EAGLE was trained on the Share GPT dataset, utilizing 68,000 dialogue iterations with a learning rate set at 3e-5.
Dataset Splits No The paper mentions using specific datasets for evaluation (MT-bench, Human Eval, GSM8K, Alpaca) and for training (Share GPT), but it does not provide explicit train/validation/test splits (e.g., percentages or counts) for reproduction purposes. These datasets are typically used as test/evaluation sets.
Hardware Specification Yes For example, with gpt-fast (Py Torch Labs, 2023), EAGLE accelerates LLa MA2-Chat 7B decoding to 160.4 tokens/s on a single RTX 3090 GPU. ... The training is completed in 1-2 days on 4x A100 (40G) GPUs. ... For Vicuna 7B as the target LLM, operating under a memory constraint of a single RTX 3090 with 24G of CUDA memory... In the case of LLa MA2-Chat 70B, constrained by 4 A100 (40G) GPUs totaling 160G of CUDA memory...
Software Dependencies No The paper mentions 'gpt-fast (Py Torch Labs, 2023)' as a tool used in combination with EAGLE, but it does not specify version numbers for Python, PyTorch, CUDA, or other key software components used in the experimental setup.
Experiment Setup Yes By integrating regression loss and classification loss, we train the Autoregression Head using the combined loss function L = Lreg + wcls Lcls. Typically, the classification loss is an order of magnitude larger than the regression loss in numerical terms. Consequently, we set wcls to 0.1. ... We employed data augmentation by adding random noise sampled from a uniform distribution U( 0.1, 0.1) to features of the target LLM during training (Jain et al., 2023). ... EAGLE was trained on the Share GPT dataset, utilizing 68,000 dialogue iterations with a learning rate set at 3e-5. We employed the Adam W optimizer with beta values (β1, β2) set to (0.9, 0.95) and implemented gradient clipping of 0.5. ... All evaluations were conducted at FP16 precision.