IncDSI: Incrementally Updatable Document Retrieval
Authors: Varsha Kishore, Chao Wan, Justin Lovelace, Yoav Artzi, Kilian Q Weinberger
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach by incrementally adding up to 10k documents to a trained retrieval model, evaluating both retrieval performance and the speed of adding documents. |
| Researcher Affiliation | Academia | 1School of Computer Science, Cornell University, Ithaca, USA. Correspondence to: Varsha Kishore <vk352@cornell.edu>, Justin Lovelace <jl3353@cornell.edu>. |
| Pseudocode | Yes | Algorithm 1 Inc DSI |
| Open Source Code | Yes | Our code for Inc DSI is available at https://github.com/varshakishore/Inc DSI. |
| Open Datasets | Yes | We conduct our experiments on two publicly available datasets Natural Questions 320K (Kwiatkowski et al., 2019) and MS MARCO Document Ranking (Nguyen et al., 2016). |
| Dataset Splits | Yes | We randomly sample 90% of the documents to form the initial set D0, 9% of the documents to form the new set D and 1% of the documents to form the tuning set D . Each dataset also has natural human queries that are associated with the documents. We use the official NQ and MSMARCO train-validation splits to divide the queries into train/val/test splits as follows: the train split is divided into 80% train/ 20% validation data and the validation split is used as test data. |
| Hardware Specification | Yes | For all our experiments, we use one A6000 GPU. |
| Software Dependencies | No | The paper mentions 'Pytorch' and 'Ax library' but does not specify their version numbers, which are required for a reproducible description of software dependencies. |
| Experiment Setup | Yes | For the continual training baselines, the document retrieval model is trained for 20 epochs on the initial set of documents and for an additional 10 epochs on both the initial and new documents. A learning rate of 1e-5 and 5e-5 and a batch size of 128 and 1024 are used for NQ320K and MSMARCO respectively. |