Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Boosting Search Engines with Interactive Agents

Authors: Leonard Adolphs, Benjamin Börschinger, Christian Buck, Michelle Chen Huebscher, Massimiliano Ciaramita, Lasse Espeholt, Thomas Hofmann, Yannic Kilcher, Sascha Rothe, Pier Giuseppe Sessa, Lierni Sestorain

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We run experiments on an open-domain question answering task, Open QA (Lee et al., 2019). Search agents learn diverse policies leading to deep, effective explorations of the search results. The Mu Zero agent outperforms a BM25 (Robertson & Zaragoza, 2009) search function running over a Wikipedia index, on both retrieval and answer quality metrics. This result provides novel evidence for the potential of knowledge-infused RL in hard NLU tasks. The T5 agent can more easily leverage large pre-trained encoder-decoders and proves superior to Mu Zero. Furthermore, a straightforward ensemble of agents is comparable in performance to the current reference neural retrieval system, DPR (Karpukhin et al., 2020), while relying solely on interpretable, symbolic retrieval operations. This suggests new challenges for future work; e.g., involving hybrid architectures and policy synthesis. We open-source the code and trained checkpoints for both agents.1,2
Researcher Affiliation Collaboration Leonard Adolphs EMAIL Benjamin Boerschinger EMAIL Christian Buck EMAIL Michelle Chen Huebscher EMAIL Massimiliano Ciaramita EMAIL Lasse Espeholt EMAIL Thomas Hofmann EMAIL Yannic Kilcher EMAIL Sascha Rothe EMAIL Pier Giuseppe Sessa EMAIL Lierni Sestorain Saralegui EMAIL ETH, Zurich Google Research
Pseudocode Yes The search for Rocchio sessions is done heuristically. Full implementation details with pseudo-code illustrating the procedure and examples can be found in 5, Appendix A, and Appendix G cf. also Table A.10.
Open Source Code Yes We open-source the code and trained checkpoints for both agents.1,2 1https://github.com/google-research/google-research/tree/master/muzero 2https://github.com/google-research/language/tree/master/language/search_agents
Open Datasets Yes For our experiments we use the Open QA-NQ dataset (Lee et al., 2019). This data is derived from Natural Questions (Kwiatkowski et al., 2019) and consists of Google queries paired with answers extracted from Wikipedia by human annotators. The data includes 79,168 train questions, 8,757 dev questions and 3,610 for test. We use the provided partitions and Wikipedia dump.
Dataset Splits Yes The data includes 79,168 train questions, 8,757 dev questions and 3,610 for test. We use the provided partitions and Wikipedia dump.
Hardware Specification Yes The Mu Zero agent learner, which performs both inference and training, runs on a Cloud TPU v2 with 8 cores which is roughly equivalent to 10 Nvidia P100 GPUs in terms of TFLOPS.10 One core is allocated for training and 7 cores are allocated for inference. We use 500 CPU based actors along with 80 actors dedicated to evaluation. For the T5 agent we start from the pretrained T5-11B (11 billion parameters) public checkpoint and continue training on the NQ Train Rocchio expansions. Training took about 5 days using 16 Cloud TPU v3.
Software Dependencies No The paper mentions Lucene's implementation, BERT, T5, and SEED RL framework, but does not provide specific version numbers for any of these software components.
Experiment Setup Yes We attempt at most M = 100 possible refinements for each grammar operator using terms from the constructed dictionaries Σ t and Σ t (see Eq. 3). For instance, for the + operator we attempt refinements of the form +(field: term ) , where term is taken from the top-M terms in the intersection dictionary Σ t and field represents the field (content or title) where such term was found. Dictionaries Σ t and Σ t are constructed (cf. 2.3) based on the set Σt of top N = 100 terms present in the documents retrieved so far, sorted according to Lucene s IDF score. For each of such possible refinements we issue the corresponding query to Lucene and, based on the returned documents, we evaluate the resulting score. We use the scoring function of Eq. 7 with coefficients λ1=0.2, λ2=0.6, after a search against the final quality metrics (cf. Appendix C). Then, we select the refinement leading to the highest score and neglect the other ones. This process continues until no score-improving refinement can be found, for a maximum of 20 refinement steps.