Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

EHI: End-to-end Learning of Hierarchical Index for Efficient Dense Retrieval

Authors: Ramnath Kumar, Anshul Mittal, Nilesh Gupta, Aditya Kusupati, Inderjit S Dhillon, Prateek Jain

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on standard benchmarks, including MS MARCO (Dev set) and TREC DL19, demonstrate EHI s superiority over traditional ANNS index. Under the same computational constraints, EHI outperforms existing state-of-the-art methods by +1.45% in MRR@10 on MS MARCO (Dev) and +8.2% in n DCG@10 on TREC DL19, highlighting the benefits of our end-to-end approach.
Researcher Affiliation Collaboration Ramnath Kumar EMAIL Google Inc. Anshul Mittal EMAIL Microsoft Google Inc. Nilesh Gupta EMAIL UT Austin Google Inc. Aditya Kusupati EMAIL Google Inc. Inderjit Dhillon EMAIL Google Inc. UT Austin Prateek Jain EMAIL Google Inc.
Pseudocode Yes Algorithm 1 Training step for EHI. u(q) & v(d) denote the query (q) & document (d) representation from encoder Eθ. Similarly path embedding by the indexer (Iφ) is denoted by T (q) & T (d). Please refer to Algorithm 2 in appendix for the definition of TOPK-INDEXER and INDEXER. Note that the Update(.) function updates the encoder and indexer parameters through the back-propagation of the loss through Adam W (Loshchilov & Hutter, 2017) in our case, while other optimizers could also be used for the same (e.g. SGD, Adam, etc.).
Open Source Code No The paper does not provide a direct link to a source code repository or an explicit statement about releasing the code for the methodology described. It only mentions using a pre-trained Sentence-BERT distilbert model from Huggingface.
Open Datasets Yes We evaluate EHI on four standard but diverse retrieval datasets of increasing size: Sci Fact (Wadden et al., 2020), FIQA (Maia et al., 2018), NQ320k (Kwiatkowski et al., 2019), and MS MARCO (Bajaj et al., 2016). Appendix A provides additional details about these datasets.
Dataset Splits Yes Sci Fact (Wadden et al., 2020) is a fact-checking benchmark that verifies scientific claims using evidence from research literature containing scientific paper abstracts. The dataset has 5000 documents and has a standard train-test split. We use the original publicly available dev split from the task having 300 test queries and include all documents from the original dataset as our corpus. [...] Fiqa (Maia et al., 2018) is an open-domain question-answering task in the domain of financial data by crawling Stack Exchange posts under the Investment topic from 2009-2017 as our corpus. It consists of 57, 368 documents and publicly available test split from (Thakur et al., 2021) As the test set, we use the random sample of 500 queries provided by Thakur et al. (2021). [...] The MSMarco benchmark Bajaj et al. (2016) has been included since it is widely recognized as the gold standard for evaluating and benchmarking large-scale information retrieval systems (Thakur et al., 2021; Ni et al., 2021). It is a collection of real-world search queries and corresponding documents carefully curated from the Microsoft Bing search engine. What sets MSMarco apart from other datasets is its scale and diversity, consisting of approximately 9 million documents in its corpus and 532,761 query-passage pairs for fine-tuning the majority of the retrievers. [...] The NQ320k benchmark (Kwiatkowski et al., 2019) has become a standard information retrieval benchmark used to showcase the efficacy of various SOTA approaches such as DSI (Tay et al., 2022) and NCI (Wang et al., 2022). In this work, we use the same NQ320k preprocessing steps as NCI.
Hardware Specification No The paper mentions that EHI takes advantage of GPUs and that different systems are implemented using different environments, but it does not specify any particular GPU models, CPU models, or detailed hardware configurations used for their experiments.
Software Dependencies No The paper mentions using a pre-trained Sentence-BERT distilbert model (Reimers & Gurevych, 2019) and the AdamW optimizer (Loshchilov & Hutter, 2017), but it does not provide specific version numbers for these software components or a comprehensive list of all software dependencies with versions.
Experiment Setup Yes Detailed training hyperparameters for EHI are provided in Appendix B. [...] Table 8: Hyperparameters used for training EHI on various datasets. Note that number of epoch and refresh rate (r) was set to 100 and 5, respectively. EHI initializes with distil BERT with encoder embedding representation as 768. The table lists Batch size, Number of leaves, Encoder Learning rate, Classifier Learning rate, Loss factor 1, 2, 3.