Nearest Neighbor Speculative Decoding for LLM Generation and Attribution

Authors: Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Scott Yih, Victoria Lin

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate NEST and other baselines on various tasks including text completion, question-answering, fact-verification, and multi-choice tasks, providing a comprehensive picture of factuality, fluency, and attribution of NEST in different domains.
Researcher Affiliation Collaboration 1 Cohere 2 Meta FAIR 3 University of Chicago 4 Carnegie Mellon University 5 University of Waterloo
Pseudocode Yes We provide the complete procedure in Algorithm 1.
Open Source Code Yes Code will be released at https://github.com/facebookresearch/NEST/tree/main.
Open Datasets Yes Wiki Text-103 (Merity et al., 2017) is a standard benchmark for language modeling, extracted from the set of verified articles on Wikipedia. Pile of Law (Henderson et al., 2022) is a growing dataset of legal and administrative data. Wikipedia (CC BY-SA 3.0): For tasks except text completion on Pile of Law, we use the Wikipedia 2021 dump released by Izacard et al. (2024) as the knowledge source and follow the same pre-processing procedures in RA-DIT (Lin et al., 2024), yielding 33M passages with each less than 200 tokens.
Dataset Splits Yes We use the datasets3 from Huggingface and further split the test data into validation and test sets. Hyper-parameters of all baselines and NEST are tuned on the dev set of Wiki Text-103, NQ, and Biography.
Hardware Specification Yes The latency experiment is done on 8 A100 GPUs (for model parallelization) and 32 CPU threads (for search).
Software Dependencies No The paper mentions software like Faiss and Pyserini, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes For relative retrieval confidence, we set α = 0.3, τ = 0.1 for all Wikipedia-based tasks and α = 0.2, τ = 0.1 for Pile of Law for all model sizes in Equation (4). For dynamic span selection, we set the n-gram length to be 64 and δ = 0.5 for all model sizes and all tasks in Equation (6). For relaxed speculative decoding, we set γ = 5e 4 for Pile of Law tasks for all model sizes in Equation (7).