Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Boosting Search Engines with Interactive Agents

Authors: Leonard Adolphs, Benjamin Börschinger, Christian Buck, Michelle Chen Huebscher, Massimiliano Ciaramita, Lasse Espeholt, Thomas Hofmann, Yannic Kilcher, Sascha Rothe, Pier Giuseppe Sessa, Lierni Sestorain

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We run experiments on an open-domain question answering task, Open QA (Lee et al., 2019). Search agents learn diverse policies leading to deep, eﬀective explorations of the search results. The Mu Zero agent outperforms a BM25 (Robertson & Zaragoza, 2009) search function running over a Wikipedia index, on both retrieval and answer quality metrics. This result provides novel evidence for the potential of knowledge-infused RL in hard NLU tasks. The T5 agent can more easily leverage large pre-trained encoder-decoders and proves superior to Mu Zero. Furthermore, a straightforward ensemble of agents is comparable in performance to the current reference neural retrieval system, DPR (Karpukhin et al., 2020), while relying solely on interpretable, symbolic retrieval operations. This suggests new challenges for future work; e.g., involving hybrid architectures and policy synthesis. We open-source the code and trained checkpoints for both agents.1,2
Researcher Affiliation	Collaboration	Leonard Adolphs EMAIL Benjamin Boerschinger EMAIL Christian Buck EMAIL Michelle Chen Huebscher EMAIL Massimiliano Ciaramita EMAIL Lasse Espeholt EMAIL Thomas Hofmann EMAIL Yannic Kilcher EMAIL Sascha Rothe EMAIL Pier Giuseppe Sessa EMAIL Lierni Sestorain Saralegui EMAIL ETH, Zurich Google Research
Pseudocode	Yes	The search for Rocchio sessions is done heuristically. Full implementation details with pseudo-code illustrating the procedure and examples can be found in 5, Appendix A, and Appendix G cf. also Table A.10.
Open Source Code	Yes	We open-source the code and trained checkpoints for both agents.1,2 1https://github.com/google-research/google-research/tree/master/muzero 2https://github.com/google-research/language/tree/master/language/search_agents
Open Datasets	Yes	For our experiments we use the Open QA-NQ dataset (Lee et al., 2019). This data is derived from Natural Questions (Kwiatkowski et al., 2019) and consists of Google queries paired with answers extracted from Wikipedia by human annotators. The data includes 79,168 train questions, 8,757 dev questions and 3,610 for test. We use the provided partitions and Wikipedia dump.
Dataset Splits	Yes	The data includes 79,168 train questions, 8,757 dev questions and 3,610 for test. We use the provided partitions and Wikipedia dump.
Hardware Specification	Yes	The Mu Zero agent learner, which performs both inference and training, runs on a Cloud TPU v2 with 8 cores which is roughly equivalent to 10 Nvidia P100 GPUs in terms of TFLOPS.10 One core is allocated for training and 7 cores are allocated for inference. We use 500 CPU based actors along with 80 actors dedicated to evaluation. For the T5 agent we start from the pretrained T5-11B (11 billion parameters) public checkpoint and continue training on the NQ Train Rocchio expansions. Training took about 5 days using 16 Cloud TPU v3.
Software Dependencies	No	The paper mentions Lucene's implementation, BERT, T5, and SEED RL framework, but does not provide specific version numbers for any of these software components.
Experiment Setup	Yes	We attempt at most M = 100 possible reﬁnements for each grammar operator using terms from the constructed dictionaries Σ t and Σ t (see Eq. 3). For instance, for the + operator we attempt reﬁnements of the form +(ﬁeld: term ) , where term is taken from the top-M terms in the intersection dictionary Σ t and ﬁeld represents the ﬁeld (content or title) where such term was found. Dictionaries Σ t and Σ t are constructed (cf. 2.3) based on the set Σt of top N = 100 terms present in the documents retrieved so far, sorted according to Lucene s IDF score. For each of such possible reﬁnements we issue the corresponding query to Lucene and, based on the returned documents, we evaluate the resulting score. We use the scoring function of Eq. 7 with coeﬃcients λ1=0.2, λ2=0.6, after a search against the ﬁnal quality metrics (cf. Appendix C). Then, we select the reﬁnement leading to the highest score and neglect the other ones. This process continues until no score-improving reﬁnement can be found, for a maximum of 20 reﬁnement steps.