Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution
Authors: Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Scott Yih, Victoria Lin
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate NEST and other baselines on various tasks including text completion, question-answering, fact-verification, and multi-choice tasks, providing a comprehensive picture of factuality, fluency, and attribution of NEST in different domains. |
| Researcher Affiliation | Collaboration | 1 Cohere 2 Meta FAIR 3 University of Chicago 4 Carnegie Mellon University 5 University of Waterloo |
| Pseudocode | Yes | We provide the complete procedure in Algorithm 1. |
| Open Source Code | Yes | Code will be released at https://github.com/facebookresearch/NEST/tree/main. |
| Open Datasets | Yes | Wiki Text-103 (Merity et al., 2017) is a standard benchmark for language modeling, extracted from the set of verified articles on Wikipedia. Pile of Law (Henderson et al., 2022) is a growing dataset of legal and administrative data. Wikipedia (CC BY-SA 3.0): For tasks except text completion on Pile of Law, we use the Wikipedia 2021 dump released by Izacard et al. (2024) as the knowledge source and follow the same pre-processing procedures in RA-DIT (Lin et al., 2024), yielding 33M passages with each less than 200 tokens. |
| Dataset Splits | Yes | We use the datasets3 from Huggingface and further split the test data into validation and test sets. Hyper-parameters of all baselines and NEST are tuned on the dev set of Wiki Text-103, NQ, and Biography. |
| Hardware Specification | Yes | The latency experiment is done on 8 A100 GPUs (for model parallelization) and 32 CPU threads (for search). |
| Software Dependencies | No | The paper mentions software like Faiss and Pyserini, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For relative retrieval confidence, we set α = 0.3, τ = 0.1 for all Wikipedia-based tasks and α = 0.2, τ = 0.1 for Pile of Law for all model sizes in Equation (4). For dynamic span selection, we set the n-gram length to be 64 and δ = 0.5 for all model sizes and all tasks in Equation (6). For relaxed speculative decoding, we set γ = 5e 4 for Pile of Law tasks for all model sizes in Equation (7). |