Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RouterRetriever: Routing over a Mixture of Expert Embedding Models

Authors: Hyunji Lee, Luca Soldaini, Arman Cohan, Minjoon Seo, Kyle Lo

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluation on the BEIR benchmark demonstrates that ROUTERRETRIEVER outperforms both models trained on MSMARCO (+2.1 absolute n DCG@10) and multi-task models (+3.2). This is achieved by employing our routing mechanism, which surpasses other routing techniques (+1.8 on average). Furthermore, the benefit generalizes well to other datasets, even in the absence of a specific expert on the dataset. ROUTERRETRIEVER is the first work to demonstrate the advantages of routing over a mixture of domain-specific expert embedding models as an alternative to a single, general-purpose embedding model, especially when retrieving from diverse, specialized domains.
Researcher Affiliation	Collaboration	Hyunji Lee1*, Luca Soldaini2, Arman Cohan2, 3, Minjoon Seo1, Kyle Lo2 1KAIST AI 2Allen Institute for AI 3Yale University EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Constructing Pilot Embedding Library
Open Source Code	Yes	Code https://github.com/amy-hyunji/RouterRetriever
Open Datasets	Yes	We use the provided training3 and test sets in the BEIR benchmark (Thakur et al. 2021).
Dataset Splits	Yes	We use the provided training3 and test sets in the BEIR benchmark (Thakur et al. 2021).
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models or memory used for experiments.
Software Dependencies	No	The paper mentions using Contriever and LoRA as methods but does not specify software versions for libraries or programming languages used for implementation.
Experiment Setup	Yes	For training, we adopt the few-shot hyperparameters from Izacard et al. (2021): a learning rate of 1e-4, a batch size of 256 with in-batch negatives, and a maximum of 500 epochs with early stopping.