reproducibilityindex.ai

Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval

Authors: Omar Khattab, Christopher Potts, Matei Zaharia

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Baleen on retrieval for two-hop question answering and many-hop claim veriﬁcation, establishing state-of-the-art performance.We ﬁrst test Baleen on the two-hop Hot Pot QA benchmark, ﬁnding evidence of saturation in retrieval: we achieve 96.3% answer recall in the top-20 retrieved passages, up from 89.4% for existing work. We then test Baleen s ability to scale accurately to more hops, reporting our main results using the recent many-hop Ho Ver task.
Researcher Affiliation	Academia	Omar Khattab Stanford University okhattab@stanford.edu Christopher Potts Stanford University cgpotts@stanford.edu Matei Zaharia Stanford University matei@cs.stanford.edu
Pseudocode	Yes	Latent hop ordering is summarized in Algorithm 1, which assumes we have already trained a single-hop (i.e., ﬁrst-hop) retriever R1 in the manner of relevance-guided supervision (see 2).
Open Source Code	Yes	1https://github.com/stanford-futuredata/Baleen
Open Datasets	Yes	We use Wikipedia, which has favorable licensing (generally under CC BY-SA 3.0), and publicly-released datasets Ho Ver and Hot Pot QA (CC BY-SA 4.0 licenses).
Dataset Splits	Yes	We report passage-level exact-match (EM) and Recall@k (R@k) on the development set.Ho Ver contains just over 18,000 training examples, about 5 smaller than Hot Pot QA, adding to the challenge posed by Ho Ver.The Hot Pot QA training set consists of 90,469 examples and the Ho Ver training set of 18,171 examples.
Hardware Specification	Yes	Our models are trained on 8 V100 GPUs.
Software Dependencies	No	The paper mentions software components like 'BERT-base (uncased) [7]', 'ELECTRA-large [5]', and 'Adam optimizer [14]', but it does not specify explicit version numbers for general software dependencies (e.g., Python, PyTorch/TensorFlow, or other libraries) required for reproduction.
Experiment Setup	Yes	We encode our queries and passages with BERT-base (uncased) [7], trained for 200k steps with a batch size of 128 (query, passage) pairs. We use Adam optimizer [14] with an initial learning rate of 1e-5. We train our passage encoder with a 2-stage distillation approach. In stage 1, we distill a smaller BERT-base model from a larger ELECTRA-large [5] using a batch size of 256 for 50k steps. In stage 2, we train the distilled BERT-base model for 100k steps using an effective batch size of 2048.We use ELECTRA-large [5] for the condenser and train for 200k steps with an effective batch size of 2048 and a learning rate of 1e-5 using Adam.