IncDSI: Incrementally Updatable Document Retrieval

Authors: Varsha Kishore, Chao Wan, Justin Lovelace, Yoav Artzi, Kilian Q Weinberger

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach by incrementally adding up to 10k documents to a trained retrieval model, evaluating both retrieval performance and the speed of adding documents.
Researcher Affiliation Academia 1School of Computer Science, Cornell University, Ithaca, USA. Correspondence to: Varsha Kishore <vk352@cornell.edu>, Justin Lovelace <jl3353@cornell.edu>.
Pseudocode Yes Algorithm 1 Inc DSI
Open Source Code Yes Our code for Inc DSI is available at https://github.com/varshakishore/Inc DSI.
Open Datasets Yes We conduct our experiments on two publicly available datasets Natural Questions 320K (Kwiatkowski et al., 2019) and MS MARCO Document Ranking (Nguyen et al., 2016).
Dataset Splits Yes We randomly sample 90% of the documents to form the initial set D0, 9% of the documents to form the new set D and 1% of the documents to form the tuning set D . Each dataset also has natural human queries that are associated with the documents. We use the official NQ and MSMARCO train-validation splits to divide the queries into train/val/test splits as follows: the train split is divided into 80% train/ 20% validation data and the validation split is used as test data.
Hardware Specification Yes For all our experiments, we use one A6000 GPU.
Software Dependencies No The paper mentions 'Pytorch' and 'Ax library' but does not specify their version numbers, which are required for a reproducible description of software dependencies.
Experiment Setup Yes For the continual training baselines, the document retrieval model is trained for 20 epochs on the initial set of documents and for an additional 10 epochs on both the initial and new documents. A learning rate of 1e-5 and 5e-5 and a batch size of 128 and 1024 are used for NQ320K and MSMARCO respectively.