reproducibilityindex.ai

Differentiable Reasoning over a Virtual Knowledge Base

Authors: Bhuwan Dhingra, Manzil Zaheer, Vidhisha Balachandran, Graham Neubig, Ruslan Salakhutdinov, William W. Cohen

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that Dr KIT improves accuracy by 9 points on 3-hop questions in the Meta QA dataset, cutting the gap between text-based and KB-based state-of-the-art by 70%. On Hotpot QA, Dr KIT leads to a 10% improvement over a BERT-based re-ranking approach to retrieving the relevant passages required to answer a question. Dr KIT is also very efﬁcient, processing 10-100x more queries per second than existing multi-hop systems.
Researcher Affiliation	Collaboration	Bhuwan Dhingra1 , Manzil Zaheer2, Vidhisha Balachandran1, Graham Neubig1, Ruslan Salakhutdinov1, William W. Cohen2 1 School of Computer Science, Carnegie Mellon University 2 Google Research {bdhingra, vbalacha, gneubig, rsalakhu}@cs.cmu.edu {manzilzaheer, wcohen}@google.com
Pseudocode	No	No structured pseudocode or algorithm blocks were found.
Open Source Code	Yes	Code available at http://www.cs.cmu.edu/ bdhingra/pages/drkit.html
Open Datasets	Yes	We ﬁrst evaluate Dr KIT on the Meta QA benchmark for multi-hop question answering (Zhang et al., 2018). Meta QA consists of around 400K questions... We used the same version of the data as Sun et al. (2019). ...For the multi-hop slot-ﬁlling experiments below, we used Wiki Data (Vrandeˇci c & Kr otzsch, 2014) as our KB, Wikipedia as the corpus, and SLING (Ringgaard et al., 2017) to identify entity mentions. ...Hotpot QA (Yang et al., 2018) is a recent dataset of over 100K crowd-sourced multi-hop questions and answers over introductory Wikipedia passages.
Dataset Splits	Yes	We tuned the number of nearest neighbors K and the softmax temperature λ on the dev set of each task, and we found K = 10000 and λ = 4 to work best. ...Details of the collected Wiki Data dataset are shown in Table 4. Task #train #dev #test ... 1hop 16901 2467 10000 ... 2hop 163607 398 9897 ... 3hop 36061 453 9899
Hardware Specification	Yes	Figure 2: Runtime on a single K80 GPU... Figure 3: Hits @1 vs Queries/sec during inference on (Left) Meta QA and (Middle) Wiki Data tasks, measured on a single CPU server with 6 cores. ...Table 3: #Bert refers to the number of calls to BERT (Devlin et al., 2019) in the model. s/Q denotes seconds per query (using batch size 1) for inference on a single 16-core CPU.
Software Dependencies	No	The paper mentions 'Tensor Flow' and 'BERT-large (Devlin et al., 2019) model' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	We use p = 400 dimensional embeddings for the mentions and queries, and 200-dimensional embeddings each for the start and end positions... For the ﬁrst hop, we assign Z0 as a 1-hot vector for the least frequent entity detected in the question using an exact match. The number of nearest neighbors K and the softmax temperature λ were tuned on the dev set of each task, and we found K = 10000 and λ = 4 to work best. ...Other hyperparameters include batch size 32, learning rate 5 10 5, number of training epochs 5, and a maximum combined passage length 512.