Self-Retrieval: End-to-End Information Retrieval with One Large Language Model
Authors: Qiaoyu Tang, Jiawei Chen, Zhuoqun Li, Bowen Yu, Yaojie Lu, ChengFu , Haiyang Yu, Hongyu Lin, Fei Huang, Ben He, Xianpei Han, Le Sun, Yongbin Li
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that Self Retrieval not only outperforms existing retrieval approaches by a significant margin, but also substantially enhances the performance of LLM-driven downstream applications like retrieval-augmented generation. 3 We evaluate Self-Retrieval on three representative retrieval benchmarks: NQ, Trivia QA, and MS MARCO. Experimental results demonstrate that Self-Retrieval substantially outperforms existing sparse retrieval, dense retrieval, and generative retrieval methods on both document-level and passage-level retrieval tasks. |
| Researcher Affiliation | Collaboration | 1Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Alibaba Group |
| Pseudocode | No | The paper describes the system components and processes in text and with a diagram (Figure 1), but no formal pseudocode or algorithm blocks are provided. |
| Open Source Code | Yes | The code of this work is available at https://github.com/icip-cas/Self Retrieval. |
| Open Datasets | Yes | We conduct main experiments on Natural Questions (NQ) [21] and Trivia QA [18] datasets, both of which are widely used retrieval benchmarks based on Wikipedia. We use their versions from the KILT benchmark [34], which consolidates these datasets into a single pre-processed Wikipedia dump, facilitating easier evaluation. [...] Additionally, to evaluate the model s robustness in non-Wikipedia scenarios where high-quality text and titles are not available, we conduct experiments on a subset of MS MARCO [3]... |
| Dataset Splits | Yes | Since the KILT test set is not publicly accessible, we use the development set for testing and randomly sample 2,000 instances from the training set as our development set. For our experiments, we sample approximately 40K documents from Wikipedia for each dataset. Each document is segmented into passages of maximum 200 words, yielding approximately 1 million passages in total. The detailed statistics of the datasets are presented in Appendix A. We use passage-level Hits@{1, 5} and Mean Reciprocal Rank (MRR)@5 as evaluation metrics. Table 6: Statistics of the experimental datasets. #doc/#psg denotes number of documents/passages; #train/#dev/#test denotes size of training/development/test set. |
| Hardware Specification | Yes | We train the models using Ze RO stage-2 optimization on 8 NVIDIA A100 (80 GB) GPUs with the Adam W optimizer, a batch size of 16 per GPU, and BFloat16 precision. |
| Software Dependencies | No | The paper mentions specific optimization techniques (Ze RO stage-2 optimization, Adam W optimizer) and model names (Stable LM, Llama2, Qwen-1.5) but does not provide version numbers for core software dependencies like PyTorch, Python, or CUDA. |
| Experiment Setup | Yes | We train the models using Ze RO stage-2 optimization on 8 NVIDIA A100 (80 GB) GPUs with the Adam W optimizer, a batch size of 16 per GPU, and BFloat16 precision. The models are trained for 3 epochs with a learning rate of 2e-5. During inference, we use beam search to generate 5 titles and 10 passages for each title, with hyperparameters τ and δ set to 0.4 across all models and datasets. |