Fast Best-of-N Decoding via Speculative Rejection

Authors: Hanshi Sun, Momin Haider, Ruiqi Zhang, Huitao Yang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, Andrea Zanette

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we conduct extensive experiments to demonstrate the effectiveness and efficiency of SPECULATIVE REJECTION. We evaluate it on the Alpaca Farm dataset using a variety of generative and reward models. Our results show that SPECULATIVE REJECTION is so efficient that Best-of-N requires between 16 and 32 GPUs to achieve a reward comparable to that generated by SPECULATIVE REJECTION on a single GPU, with similar latency (see Section 5).
Researcher Affiliation Collaboration Hanshi Sun1 , Momin Haider2 , Ruiqi Zhang3 , Huitao Yang5, Jiahao Qiu4, Ming Yin4, Mengdi Wang4, Peter L. Bartlett3,6, Andrea Zanette1 1Carnegie Mellon University, 2University of Virginia, 3UC Berkeley 4Princeton University, 5Fudan University, 6Google Deep Mind
Pseudocode Yes Algorithm 1 SPECULATIVE REJECTION
Open Source Code Yes The code is available at https: //github.com/Zanette-Labs/Speculative Rejection.
Open Datasets Yes We evaluate it on the Alpaca Farm dataset using a variety of generative and reward models. ... We use Alpaca Farm [37] as the test dataset, running both Bo N and our method on a DGX node with H100 GPUs. [37] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instructionfollowing models. https://github.com/tatsu-lab/alpaca_eval, 5 2023.
Dataset Splits Yes We use Alpaca Farm [37] as the test dataset... We randomly sample 100 prompts in the Alpaca Farm-Eval dataset.
Hardware Specification Yes We use Alpaca Farm [37] as the test dataset, running both Bo N and our method on a DGX node with H100 GPUs.
Software Dependencies No Our implementation, based on Py Torch, features an efficient inference system... We utilize the standard generate() function in Hugging Face transformers [68]... However, specific version numbers for PyTorch, Hugging Face transformers, or other dependencies are not provided.
Experiment Setup Yes We set Best-of-120 as the baseline because it can run on a single 80GB GPU, producing all utterances concurrently without running out of memory. Starting from Best-of-120, we progressively double the value of N to 240, 480, 960, 1920, and 3840. ... For SPECULATIVE REJECTION, we additionally report the rejection rate α .