Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
Authors: Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our implementation of LOOKAHEAD DECODING can speed up autoregressive decoding by up to 1.8x on MT-bench and 4x with strong scaling on multiple GPUs in code completion tasks. We evaluated LOOKAHEAD DECODING on the popular LLa MA-2 (Touvron et al., 2023b) models. It achieves 1.8x speedup on the challenging multi-turn chat dataset MT-Bench (Zheng et al., 2024) and up to 4x speedup in code completion tasks with Lookahead Parallelism on 8 GPUs. We used various versions of the LLa MA-2 (Touvron et al., 2023b) and Code Llama (Roziere et al., 2023) models, including the 7B, 13B, 34B, and 70B sizes, on two GPU setups S1 and S2. Datasets. We benchmarked LOOKAHEAD DECODING s performance across a broad spectrum of datasets and tasks. MT-Bench (Zheng et al., 2024) is a diverse set of multi-turn questions with many unique tokens. GSM8K (Cobbe et al., 2021) contains a set of math questions, in which we use the first 1k questions. Human Eval (Chen et al., 2021) covers both code completion and infilling tasks. We also test on MBPP (Austin et al., 2021) dataset for instruction-based code generation, and on Class Eval (Du et al., 2023) for class-level code completion. Tab. 1 lists detailed settings. Fig. 5 shows the end-to-end performance of LOOKAHEAD DECODING when compared with Hugging Face s implementation of greedy search on S1. Across various datasets, LOOKAHEAD DECODING demonstrates a 1.4x-2.3x speedup. Table 3: Compare the effectiveness of both lookahead and verification branch on MT-Bench on A100. Flash Attention is activated. We show the speedups against autoregressive decoding and the compression ratio (S). |
| Researcher Affiliation | Collaboration | Yichao Fu 1 Peter Bailis 2 Ion Stoica 3 Hao Zhang 1 Work done when Yichao Fu was visiting UCSD 1UCSD 2Google 3UC Berkeley. |
| Pseudocode | Yes | Fig. 1 illustrates its workflow, and Algorithm 1 shows its detail. Algorithm 1 Lookahead decoding. Algorithm 3 Greedy Verification with LOOKAHEAD DECODING. Algorithm 4 Sample Verification with LOOKAHEAD DECODING. |
| Open Source Code | Yes | Our code is avialable at https://github.com /hao-ai-lab/Lookahead Decoding |
| Open Datasets | Yes | MT-Bench (Zheng et al., 2024) is a diverse set of multi-turn questions with many unique tokens. GSM8K (Cobbe et al., 2021) contains a set of math questions, in which we use the first 1k questions. Human Eval (Chen et al., 2021) covers both code completion and infilling tasks. We also test on MBPP (Austin et al., 2021) dataset for instruction-based code generation, and on Class Eval (Du et al., 2023) for class-level code completion. In addition, we validate the effectiveness of sampling ( 3.2) on XSum (Narayan et al., 2018) and CNN/Daily Mail (See et al., 2017) datasets. |
| Dataset Splits | No | The paper describes the datasets used and notes some settings like sequence length, but does not provide specific train/validation/test splits (e.g., percentages or sample counts) needed for reproduction. |
| Hardware Specification | Yes | S1 is equipped with NVIDIA A100 GPUs with 80GB of memory. S2 is a DGX machine with 8 NVIDIA A100 GPUs with 40GB memory and NVLink. on RTX 3090 GPUs. |
| Software Dependencies | Yes | We have implemented the algorithm in Python and CUDA, which is compatible with memory-efficient attention algorithms (e.g., Flash Attention (Dao, 2023)). pipeline parallelism supported by Accelerate (Gugger et al., 2022). TP (supported by deepspeed (Aminabadi et al., 2022)). |
| Experiment Setup | Yes | All models serve with FP16 precision and batch of 1 if not specified. To control generation length in code generation tasks, we set the maximum sequence length to 512 and 2,048 on Human Eval and Class Eval, respectively. Tab. 1 lists detailed settings. Table 4: Good Config. of LOOKAHEAD DECODING on A100 GPUs with G = W. MODEL WINDOW SIZE (W ) N-GRAM SIZE (N) 7B 15 5 13B 10 5 34B 7 5. |