Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
HopRetriever: Retrieve Hops over Wikipedia to Answer Complex Questions
Authors: Shaobo Li, Xiaoguang Li, Lifeng Shang, Xin Jiang, Qun Liu, Chengjie Sun, Zhenzhou Ji, Bingquan Liu13279-13287
AAAI 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the Hotpot QA dataset demonstrate that Hop Retriever outperforms previously published evidence retrieval methods by large margins. |
| Researcher Affiliation | Collaboration | Shaobo Li,1 Xiaoguang Li,2 Lifeng Shang,2 Xin Jiang,2 Qun Liu,2 Chengjie Sun,1 Zhenzhou Ji,1 Bingquan Liu1 1Harbin Institute of Technology 2Huawei Noah s Ark Lab |
| Pseudocode | No | The paper describes its methods in narrative text and figures, but does not include structured pseudocode blocks or algorithms labeled as such. |
| Open Source Code | No | The paper does not provide a direct link to a source code repository, nor does it explicitly state that the code for the methodology is released or available. |
| Open Datasets | Yes | Hop Retriever is evaluated on the multi-hop question answering dataset Hotpot QA (Yang et al. 2018), which includes 90,564 question-answer pairs with annotated supporting documents and sentences for training, 7,405 question-answer pairs for development, and 7,405 questions for testing. |
| Dataset Splits | Yes | Hotpot QA (Yang et al. 2018), which includes 90,564 question-answer pairs with annotated supporting documents and sentences for training, 7,405 question-answer pairs for development, and 7,405 questions for testing. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions using BERT and BERT-base, but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We restrict the maximum input sequence length of BERT to 384. In training, the batch size is set to 16, the learning rate is 3 × 10−5, and the number of training epochs is 3. We use beam search with beam size set to 8 at the inference time. |