Fast Best-of-N Decoding via Speculative Rejection
Authors: Hanshi Sun, Momin Haider, Ruiqi Zhang, Huitao Yang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, Andrea Zanette
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we conduct extensive experiments to demonstrate the effectiveness and efficiency of SPECULATIVE REJECTION. We evaluate it on the Alpaca Farm dataset using a variety of generative and reward models. Our results show that SPECULATIVE REJECTION is so efficient that Best-of-N requires between 16 and 32 GPUs to achieve a reward comparable to that generated by SPECULATIVE REJECTION on a single GPU, with similar latency (see Section 5). |
| Researcher Affiliation | Collaboration | Hanshi Sun1 , Momin Haider2 , Ruiqi Zhang3 , Huitao Yang5, Jiahao Qiu4, Ming Yin4, Mengdi Wang4, Peter L. Bartlett3,6, Andrea Zanette1 1Carnegie Mellon University, 2University of Virginia, 3UC Berkeley 4Princeton University, 5Fudan University, 6Google Deep Mind |
| Pseudocode | Yes | Algorithm 1 SPECULATIVE REJECTION |
| Open Source Code | Yes | The code is available at https: //github.com/Zanette-Labs/Speculative Rejection. |
| Open Datasets | Yes | We evaluate it on the Alpaca Farm dataset using a variety of generative and reward models. ... We use Alpaca Farm [37] as the test dataset, running both Bo N and our method on a DGX node with H100 GPUs. [37] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instructionfollowing models. https://github.com/tatsu-lab/alpaca_eval, 5 2023. |
| Dataset Splits | Yes | We use Alpaca Farm [37] as the test dataset... We randomly sample 100 prompts in the Alpaca Farm-Eval dataset. |
| Hardware Specification | Yes | We use Alpaca Farm [37] as the test dataset, running both Bo N and our method on a DGX node with H100 GPUs. |
| Software Dependencies | No | Our implementation, based on Py Torch, features an efficient inference system... We utilize the standard generate() function in Hugging Face transformers [68]... However, specific version numbers for PyTorch, Hugging Face transformers, or other dependencies are not provided. |
| Experiment Setup | Yes | We set Best-of-120 as the baseline because it can run on a single 80GB GPU, producing all utterances concurrently without running out of memory. Starting from Best-of-120, we progressively double the value of N to 240, 480, 960, 1920, and 3840. ... For SPECULATIVE REJECTION, we additionally report the rejection rate α . |