Retrieval is Accurate Generation

Authors: Bowen Cao, Deng Cai, Leyang Cui, Xuxin Cheng, Wei Bi, Yuexian Zou, Shuming Shi

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our model not only outperforms standard language models on a variety of knowledge-intensive tasks but also demonstrates improved generation quality in open-ended text generation. We verify the effectiveness of our methods on a set of knowledge-intensive tasks and open-ended text generation tasks without fine-tuning.
Researcher Affiliation Collaboration Bowen Cao , Deng Cai , Leyang Cui Xuxin Cheng Wei Bi Yuexian Zou Shuming Shi School of ECE, Peking University Tencent AI Lab {cbw2021,chengxx}@stu.pku.edu.cn, zouyx@pku.edu.cn thisisjcykcd@gmail.com, {leyangcui,victoriabi,shumingshi}@tencent.com
Pseudocode No The paper describes processes like the 'bootstrapping algorithm' in text but does not provide structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing the source code for the described methodology or a direct link to its repository. It only cites and links to external, pre-existing models like GPT-2 and Dense Phrases.
Open Datasets Yes We train our model on the training set of Mini Pile2(Kaddour, 2023), and use the English Wikipedia dump March 1, 20223 as supporting documents. 2https://huggingface.co/datasets/Jean Kaddour/minipile 3https://huggingface.co/datasets/wikipedia and 8https://huggingface.co/datasets/gamino/wiki medical terms
Dataset Splits Yes We train our model on the training set of Mini Pile2(Kaddour, 2023)... We conduct open-ended text generation experiments on the test set of Mini Pile (Kaddour, 2023)... Med MCQA (Pal et al., 2022) is a comprehensive, high-quality dataset designed for biomedical question-answering. We use its validation split, which consists of 4,183 questions.
Hardware Specification Yes The entire preprocessing process, including syntactic parsing, phrase selection, and semantic matching, takes approximately 24 hours on 8 V100 GPUs.
Software Dependencies No The paper mentions tools like 'Stanford Parser' (via a link to Stanza) and 'FAISS', but it does not specify concrete version numbers for these or any other software dependencies required for reproducibility.
Experiment Setup Yes While revising the training oracles via self-reinforcement, we retrieve the top k = 128 phrases for each prefix. In all experiments, we set k to 128 (see the analysis on k in Table 7 in Appendix G) and p to 0.95. To control the ratio of phrase retrieval, we filter out phrases with probabilities below a threshold. The threshold is set to ϕ = 0.4 if not otherwise specified.