Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
Authors: Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Scott Yih, Victoria Lin
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate NEST and other baselines on various tasks including text completion, question-answering, fact-verification, and multi-choice tasks, providing a comprehensive picture of factuality, fluency, and attribution of NEST in different domains. |
| Researcher Affiliation | Collaboration | 1 Cohere 2 Meta FAIR 3 University of Chicago 4 Carnegie Mellon University 5 University of Waterloo |
| Pseudocode | Yes | We provide the complete procedure in Algorithm 1. |
| Open Source Code | Yes | Code will be released at https://github.com/facebookresearch/NEST/tree/main. |
| Open Datasets | Yes | Wiki Text-103 (Merity et al., 2017) is a standard benchmark for language modeling, extracted from the set of verified articles on Wikipedia. Pile of Law (Henderson et al., 2022) is a growing dataset of legal and administrative data. Wikipedia (CC BY-SA 3.0): For tasks except text completion on Pile of Law, we use the Wikipedia 2021 dump released by Izacard et al. (2024) as the knowledge source and follow the same pre-processing procedures in RA-DIT (Lin et al., 2024), yielding 33M passages with each less than 200 tokens. |
| Dataset Splits | Yes | We use the datasets3 from Huggingface and further split the test data into validation and test sets. Hyper-parameters of all baselines and NEST are tuned on the dev set of Wiki Text-103, NQ, and Biography. |
| Hardware Specification | Yes | The latency experiment is done on 8 A100 GPUs (for model parallelization) and 32 CPU threads (for search). |
| Software Dependencies | No | The paper mentions software like Faiss and Pyserini, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For relative retrieval confidence, we set α = 0.3, τ = 0.1 for all Wikipedia-based tasks and α = 0.2, τ = 0.1 for Pile of Law for all model sizes in Equation (4). For dynamic span selection, we set the n-gram length to be 64 and δ = 0.5 for all model sizes and all tasks in Equation (6). For relaxed speculative decoding, we set γ = 5e 4 for Pile of Law tasks for all model sizes in Equation (7). |