Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Differentiable Reasoning over a Virtual Knowledge Base
Authors: Bhuwan Dhingra, Manzil Zaheer, Vidhisha Balachandran, Graham Neubig, Ruslan Salakhutdinov, William W. Cohen
ICLR 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that Dr KIT improves accuracy by 9 points on 3-hop questions in the Meta QA dataset, cutting the gap between text-based and KB-based state-of-the-art by 70%. On Hotpot QA, Dr KIT leads to a 10% improvement over a BERT-based re-ranking approach to retrieving the relevant passages required to answer a question. Dr KIT is also very efficient, processing 10-100x more queries per second than existing multi-hop systems. |
| Researcher Affiliation | Collaboration | Bhuwan Dhingra1 , Manzil Zaheer2, Vidhisha Balachandran1, Graham Neubig1, Ruslan Salakhutdinov1, William W. Cohen2 1 School of Computer Science, Carnegie Mellon University 2 Google Research EMAIL EMAIL |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. |
| Open Source Code | Yes | Code available at http://www.cs.cmu.edu/ bdhingra/pages/drkit.html |
| Open Datasets | Yes | We first evaluate Dr KIT on the Meta QA benchmark for multi-hop question answering (Zhang et al., 2018). Meta QA consists of around 400K questions... We used the same version of the data as Sun et al. (2019). ...For the multi-hop slot-filling experiments below, we used Wiki Data (Vrandeˇci c & Kr otzsch, 2014) as our KB, Wikipedia as the corpus, and SLING (Ringgaard et al., 2017) to identify entity mentions. ...Hotpot QA (Yang et al., 2018) is a recent dataset of over 100K crowd-sourced multi-hop questions and answers over introductory Wikipedia passages. |
| Dataset Splits | Yes | We tuned the number of nearest neighbors K and the softmax temperature λ on the dev set of each task, and we found K = 10000 and λ = 4 to work best. ...Details of the collected Wiki Data dataset are shown in Table 4. Task #train #dev #test ... 1hop 16901 2467 10000 ... 2hop 163607 398 9897 ... 3hop 36061 453 9899 |
| Hardware Specification | Yes | Figure 2: Runtime on a single K80 GPU... Figure 3: Hits @1 vs Queries/sec during inference on (Left) Meta QA and (Middle) Wiki Data tasks, measured on a single CPU server with 6 cores. ...Table 3: #Bert refers to the number of calls to BERT (Devlin et al., 2019) in the model. s/Q denotes seconds per query (using batch size 1) for inference on a single 16-core CPU. |
| Software Dependencies | No | The paper mentions 'Tensor Flow' and 'BERT-large (Devlin et al., 2019) model' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We use p = 400 dimensional embeddings for the mentions and queries, and 200-dimensional embeddings each for the start and end positions... For the first hop, we assign Z0 as a 1-hot vector for the least frequent entity detected in the question using an exact match. The number of nearest neighbors K and the softmax temperature λ were tuned on the dev set of each task, and we found K = 10000 and λ = 4 to work best. ...Other hyperparameters include batch size 32, learning rate 5 10 5, number of training epochs 5, and a maximum combined passage length 512. |