Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval
Authors: Omar Khattab, Christopher Potts, Matei Zaharia
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Baleen on retrieval for two-hop question answering and many-hop claim verification, establishing state-of-the-art performance.We first test Baleen on the two-hop Hot Pot QA benchmark, finding evidence of saturation in retrieval: we achieve 96.3% answer recall in the top-20 retrieved passages, up from 89.4% for existing work. We then test Baleen s ability to scale accurately to more hops, reporting our main results using the recent many-hop Ho Ver task. |
| Researcher Affiliation | Academia | Omar Khattab Stanford University okhattab@stanford.edu Christopher Potts Stanford University cgpotts@stanford.edu Matei Zaharia Stanford University matei@cs.stanford.edu |
| Pseudocode | Yes | Latent hop ordering is summarized in Algorithm 1, which assumes we have already trained a single-hop (i.e., first-hop) retriever R1 in the manner of relevance-guided supervision (see 2). |
| Open Source Code | Yes | 1https://github.com/stanford-futuredata/Baleen |
| Open Datasets | Yes | We use Wikipedia, which has favorable licensing (generally under CC BY-SA 3.0), and publicly-released datasets Ho Ver and Hot Pot QA (CC BY-SA 4.0 licenses). |
| Dataset Splits | Yes | We report passage-level exact-match (EM) and Recall@k (R@k) on the development set.Ho Ver contains just over 18,000 training examples, about 5 smaller than Hot Pot QA, adding to the challenge posed by Ho Ver.The Hot Pot QA training set consists of 90,469 examples and the Ho Ver training set of 18,171 examples. |
| Hardware Specification | Yes | Our models are trained on 8 V100 GPUs. |
| Software Dependencies | No | The paper mentions software components like 'BERT-base (uncased) [7]', 'ELECTRA-large [5]', and 'Adam optimizer [14]', but it does not specify explicit version numbers for general software dependencies (e.g., Python, PyTorch/TensorFlow, or other libraries) required for reproduction. |
| Experiment Setup | Yes | We encode our queries and passages with BERT-base (uncased) [7], trained for 200k steps with a batch size of 128 (query, passage) pairs. We use Adam optimizer [14] with an initial learning rate of 1e-5. We train our passage encoder with a 2-stage distillation approach. In stage 1, we distill a smaller BERT-base model from a larger ELECTRA-large [5] using a batch size of 256 for 50k steps. In stage 2, we train the distilled BERT-base model for 100k steps using an effective batch size of 2048.We use ELECTRA-large [5] for the condenser and train for 200k steps with an effective batch size of 2048 and a learning rate of 1e-5 using Adam. |