Generate rather than Retrieve: Large Language Models are Strong Context Generators
Authors: Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng, Meng Jiang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on three different knowledge-intensive tasks, including open-domain QA, fact checking, and dialogue system. |
| Researcher Affiliation | Collaboration | 1University of Notre Dame 2Microsoft Cognitive Service Research 3University of Southern California |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and generated documents can be found at https://github.com/wyu97/Gen Read. |
| Open Datasets | Yes | open-domain QA (NQ (Kwiatkowski et al., 2019), Trivia QA (Joshi et al., 2017) and Web Q (Berant et al., 2013)), fact checking (FEVER (Thorne et al., 2018) and FM2 (Eisenschlos et al., 2021)) and open-domain dialogue system (Wo W (Dinan et al., 2019)). |
| Dataset Splits | Yes | Table 5: Datasets Splits Train Valid Test Test labels Trivia QA (Joshi et al., 2017) open domain 78,785 8,837 11,313 public wikipedia split 7,993 public Web Q (Berant et al., 2013) open domain 3,478 300 2,032 public NQ (Kwiatkowski et al., 2019) open domain 79,168 8,757 3,610 public FEVER (Thorne et al., 2018) kilt challenge 104,966 10,444 10,100 hidden FM2 (Eisenschlos et al., 2021) official split 10,149 1169 1380 public Wo W (Dinan et al., 2019) kilt challenge 63,734 3,054 2,944 hidden |
| Hardware Specification | Yes | As reported by Izacard & Grave (2021), the training process requires 64 Tesla V100 32GB running for around one day. We use one A100 for running T5-770M and set the batch size of 16. We use 8 A100 for running T5-3B and set the per GPU batch as 2, leading to the total batch size as 16. |
| Software Dependencies | No | The paper mentions models like T5 and Fi D, and links to repositories for baseline implementations (BM25, DPR, Contriever), but does not specify software versions (e.g., Python, PyTorch, or specific library versions) for reproducibility. |
| Experiment Setup | Yes | We use AdamW as the optimizer, with 2,000 warm-up steps. We set the dropout probability to 0.1 and weight decay to 0.01. We use one A100 for running T5-770M and set the batch size of 16. We use 8 A100 for running T5-3B and set the per GPU batch as 2, leading to the total batch size as 16. We searched different learning rates, ranging from 5e-6 to 4e-5, and we found 3e-5 to 6e-5 performed the best under the T5-3B setting and 5e-5 to 1e-4 performed the best under the T5-770M setting. (and Table 6 details: 'Peak learning rate', 'Total batch size', 'Total training steps'). |