Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

EHI: End-to-end Learning of Hierarchical Index for Efficient Dense Retrieval

Authors: Ramnath Kumar, Anshul Mittal, Nilesh Gupta, Aditya Kusupati, Inderjit S Dhillon, Prateek Jain

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations on standard benchmarks, including MS MARCO (Dev set) and TREC DL19, demonstrate EHI s superiority over traditional ANNS index. Under the same computational constraints, EHI outperforms existing state-of-the-art methods by +1.45% in MRR@10 on MS MARCO (Dev) and +8.2% in n DCG@10 on TREC DL19, highlighting the beneﬁts of our end-to-end approach.
Researcher Affiliation	Collaboration	Ramnath Kumar EMAIL Google Inc. Anshul Mittal EMAIL Microsoft Google Inc. Nilesh Gupta EMAIL UT Austin Google Inc. Aditya Kusupati EMAIL Google Inc. Inderjit Dhillon EMAIL Google Inc. UT Austin Prateek Jain EMAIL Google Inc.
Pseudocode	Yes	Algorithm 1 Training step for EHI. u(q) & v(d) denote the query (q) & document (d) representation from encoder Eθ. Similarly path embedding by the indexer (Iφ) is denoted by T (q) & T (d). Please refer to Algorithm 2 in appendix for the deﬁnition of TOPK-INDEXER and INDEXER. Note that the Update(.) function updates the encoder and indexer parameters through the back-propagation of the loss through Adam W (Loshchilov & Hutter, 2017) in our case, while other optimizers could also be used for the same (e.g. SGD, Adam, etc.).
Open Source Code	No	The paper does not provide a direct link to a source code repository or an explicit statement about releasing the code for the methodology described. It only mentions using a pre-trained Sentence-BERT distilbert model from Huggingface.
Open Datasets	Yes	We evaluate EHI on four standard but diverse retrieval datasets of increasing size: Sci Fact (Wadden et al., 2020), FIQA (Maia et al., 2018), NQ320k (Kwiatkowski et al., 2019), and MS MARCO (Bajaj et al., 2016). Appendix A provides additional details about these datasets.
Dataset Splits	Yes	Sci Fact (Wadden et al., 2020) is a fact-checking benchmark that veriﬁes scientiﬁc claims using evidence from research literature containing scientiﬁc paper abstracts. The dataset has 5000 documents and has a standard train-test split. We use the original publicly available dev split from the task having 300 test queries and include all documents from the original dataset as our corpus. [...] Fiqa (Maia et al., 2018) is an open-domain question-answering task in the domain of ﬁnancial data by crawling Stack Exchange posts under the Investment topic from 2009-2017 as our corpus. It consists of 57, 368 documents and publicly available test split from (Thakur et al., 2021) As the test set, we use the random sample of 500 queries provided by Thakur et al. (2021). [...] The MSMarco benchmark Bajaj et al. (2016) has been included since it is widely recognized as the gold standard for evaluating and benchmarking large-scale information retrieval systems (Thakur et al., 2021; Ni et al., 2021). It is a collection of real-world search queries and corresponding documents carefully curated from the Microsoft Bing search engine. What sets MSMarco apart from other datasets is its scale and diversity, consisting of approximately 9 million documents in its corpus and 532,761 query-passage pairs for ﬁne-tuning the majority of the retrievers. [...] The NQ320k benchmark (Kwiatkowski et al., 2019) has become a standard information retrieval benchmark used to showcase the eﬃcacy of various SOTA approaches such as DSI (Tay et al., 2022) and NCI (Wang et al., 2022). In this work, we use the same NQ320k preprocessing steps as NCI.
Hardware Specification	No	The paper mentions that EHI takes advantage of GPUs and that different systems are implemented using different environments, but it does not specify any particular GPU models, CPU models, or detailed hardware configurations used for their experiments.
Software Dependencies	No	The paper mentions using a pre-trained Sentence-BERT distilbert model (Reimers & Gurevych, 2019) and the AdamW optimizer (Loshchilov & Hutter, 2017), but it does not provide specific version numbers for these software components or a comprehensive list of all software dependencies with versions.
Experiment Setup	Yes	Detailed training hyperparameters for EHI are provided in Appendix B. [...] Table 8: Hyperparameters used for training EHI on various datasets. Note that number of epoch and refresh rate (r) was set to 100 and 5, respectively. EHI initializes with distil BERT with encoder embedding representation as 768. The table lists Batch size, Number of leaves, Encoder Learning rate, Classiﬁer Learning rate, Loss factor 1, 2, 3.