Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Authors: Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela
NeurIPS 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state of the art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. |
| Researcher Affiliation | Collaboration | Facebook AI Research; University College London; New York University; EMAIL |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code to run experiments with RAG has been open-sourced as part of the Hugging Face Transformers Library [66] and can be found at https://github.com/huggingface/transformers/blob/master/ examples/rag/. An interactive demo of RAG models can be found at https://huggingface.co/rag/ |
| Open Datasets | Yes | We consider four popular open-domain QA datasets: Natural Questions (NQ) [29], Trivia QA (TQA) [24]. Web Questions (WQ) [3] and Curated Trec (CT) [2]... We use the MSMARCO NLG task v2.1 [43]... We use the splits from Search QA [10]... FEVER [56]... We use a single Wikipedia dump for our non-parametric knowledge source. Following Lee et al. [31] and Karpukhin et al. [26], we use the December 2018 dump. |
| Dataset Splits | Yes | We consider k ∈ {5, 10} for training and set k for test time using dev data. |
| Hardware Specification | No | The paper discusses the models and datasets used but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) on which the experiments were run. |
| Software Dependencies | No | The paper mentions software components like Hugging Face Transformers Library and FAISS but does not specify version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Given a fine-tuning training corpus of input/output pairs (xj, yj), we minimize the negative marginal log-likelihood of each target, Pj log p(yj|xj) using stochastic gradient descent with Adam [28]... We consider k ∈ {5, 10} for training and set k for test time using dev data. |