Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hierarchical Retrieval: The Geometry and a Pretrain-Finetune Recipe

Authors: Chong You, Rajesh Jayaram, Ananda Theertha Suresh, Robin Nittka, Felix Yu, Sanjiv Kumar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments reveal a lost-in-the-long-distance phenomenon, where retrieval accuracy degrades for documents further away in the hierarchy. To address this, we introduce a pretrain-ﬁnetune recipe that signiﬁcantly improves long-distance retrieval without sacriﬁcing performance on closer documents. We experiment on a realistic hierarchy from Word Net for retrieving documents at various levels of abstraction, and show that pretrainﬁnetune boosts the recall on long-distance pairs from 19% to 76%. Finally, we demonstrate that our method improves retrieval of relevant products on a shopping queries dataset.
Researcher Affiliation	Industry	Chong You Rajesh Jayaram Ananda Theertha Suresh Robin Nittka Felix Yu Sanjiv Kumar Google EMAIL
Pseudocode	Yes	Algorithm 1 A constructive algorithm for Hierarchical Retrieval
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justiﬁcation: We do not provide open access to the code.
Open Datasets	Yes	6 Experiments on Real Data In this section, we experiment with the pretrain-ﬁnetune recipe on two real datasets, namely Word Net and ESCI. 6.1 Word Net Experiments Word Net [22] is a large lexical database of English where the nouns, verbs, adjectives, and adverbs are grouped into synsets that represent synonyms. 6.2 Experiment on ESCI Shopping Dataset ESCI [28] is a public Amazon search dataset, containing 2.6 million manually labeled query-product relevance judgements in four categories, namely, Exact, Substitute, Complement, and Irrelevant.
Dataset Splits	Yes	For the tree with H = 4, W = 5, any query that corresponds to a leaf node has 3 matching documents with distance 0, 1, and 2, respectively (recall that the root node does not correspond to any query / document). We evaluate recall for query-document pairs at these three distances separately, and report results in Figure 3a. Unless speciﬁed otherwise, we use the following regular sampling procedure to generate training and evaluation data. First, a query q is sampled uniformly at random among all 82,115 synsets. Then, a document is sampled uniformly at random from the set of all matching documents to q. ... We use a validation set of size 10k to pick the best checkpoint. Then, we report in Table 1 the recall computed on a test set of size 10k, including an overall recall that is averaged over all pairs in the test set, and recall on slices with different query-document distances (i.e., 0, 1, ..., 8). ESCI comes with a train vs test data splitting. We take the Exact and Substitute pairs from the train split as our training sets, denoted as Etrain and Strain, respectively. Etrain and Strain contain 1.3 million and 0.4 million matches, respectively. We sample 5k Exact and 2k Substitute pairs from the test split for evaluating our model. These two sets are denoted as Etest and Stest, respectively.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU model, CPU model, TPU version) used for running its experiments. It mentions using 'Transformers for the encoder models' but not the underlying hardware.
Software Dependencies	No	We use the Sentence Piece tokenizer, Transformers for the encoder models in DE, and the Lazy Adam optimizer [3]. Details are provided in Appendix B.
Experiment Setup	Yes	We train a lookup-table DE by optimizing Equation (1) using SGD for 50k iterations on 10M matching pairs from regular sampling. We use learning rate 0.5, momentum 0.9, and batch size 4096. During ﬁnetuning, we reduce the learning rate to 1,000 times smaller and increase the temperature in Equation (1) from 20 to 500; an ablation study on these two hyper-parameters is provided in Appendix E. For the Transformer, we use model dimension 512, 8 attention heads, two-layer MLP with GELU activation and a hidden dimension of 4096 as the feedforward network. The output embeddings from the Transformer are mean-pooled and projected to 128 dimensions, followed by a normalization to the unit ℓ2 sphere as the ﬁnal embedding. The model is trained with the Lazy Adam optimizer [3] with a warmup stage of 2000 steps to a learning rate of 1e-4, followed by a linear decay to 1e-6 at step 50000.