Generate rather than Retrieve: Large Language Models are Strong Context Generators

Authors: Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng, Meng Jiang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on three different knowledge-intensive tasks, including open-domain QA, fact checking, and dialogue system.
Researcher Affiliation Collaboration 1University of Notre Dame 2Microsoft Cognitive Service Research 3University of Southern California
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our code and generated documents can be found at https://github.com/wyu97/Gen Read.
Open Datasets Yes open-domain QA (NQ (Kwiatkowski et al., 2019), Trivia QA (Joshi et al., 2017) and Web Q (Berant et al., 2013)), fact checking (FEVER (Thorne et al., 2018) and FM2 (Eisenschlos et al., 2021)) and open-domain dialogue system (Wo W (Dinan et al., 2019)).
Dataset Splits Yes Table 5: Datasets Splits Train Valid Test Test labels Trivia QA (Joshi et al., 2017) open domain 78,785 8,837 11,313 public wikipedia split 7,993 public Web Q (Berant et al., 2013) open domain 3,478 300 2,032 public NQ (Kwiatkowski et al., 2019) open domain 79,168 8,757 3,610 public FEVER (Thorne et al., 2018) kilt challenge 104,966 10,444 10,100 hidden FM2 (Eisenschlos et al., 2021) official split 10,149 1169 1380 public Wo W (Dinan et al., 2019) kilt challenge 63,734 3,054 2,944 hidden
Hardware Specification Yes As reported by Izacard & Grave (2021), the training process requires 64 Tesla V100 32GB running for around one day. We use one A100 for running T5-770M and set the batch size of 16. We use 8 A100 for running T5-3B and set the per GPU batch as 2, leading to the total batch size as 16.
Software Dependencies No The paper mentions models like T5 and Fi D, and links to repositories for baseline implementations (BM25, DPR, Contriever), but does not specify software versions (e.g., Python, PyTorch, or specific library versions) for reproducibility.
Experiment Setup Yes We use AdamW as the optimizer, with 2,000 warm-up steps. We set the dropout probability to 0.1 and weight decay to 0.01. We use one A100 for running T5-770M and set the batch size of 16. We use 8 A100 for running T5-3B and set the per GPU batch as 2, leading to the total batch size as 16. We searched different learning rates, ranging from 5e-6 to 4e-5, and we found 3e-5 to 6e-5 performed the best under the T5-3B setting and 5e-5 to 1e-4 performed the best under the T5-770M setting. (and Table 6 details: 'Peak learning rate', 'Total batch size', 'Total training steps').