Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering
Authors: Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, Caiming Xiong
ICLR 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show state-of-the-art results in three open-domain QA datasets, showcasing the effectiveness and robustness of our method. Notably, our method achieves significant improvement in Hotpot QA, outperforming the previous best model by more than 14 points.1 |
| Researcher Affiliation | Collaboration | University of Washington Salesforce Research Allen Institute for Artificial Intelligence EMAIL EMAIL |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | 1Our code and data id available at https://github.com/Akari Asai/learning_to_ retrieve_reasoning_paths. |
| Open Datasets | Yes | We evaluate our method in three open-domain Wikipedia-sourced datasets: Hotpot QA, SQu AD Open and Natural Questions Open. |
| Dataset Splits | Yes | The Hotpot QA training, development, and test datasets contain 90,564, 7,405 and 7,405 questions, respectively. |
| Hardware Specification | No | The paper states, 'our retriever can be handled on a single GPU machine,' but does not specify any exact GPU model, CPU model, or other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions 'pytorch-transformers' and 'Py Torch' as software used, and 'Adam optimizer' for optimization, but specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | To train our recurrent retriever, we set the learning rate to 3 x 10^-5, and the maximum number of the training epochs to three. The mini-batch size is four; a mini-batch example consists of a question with its corresponding paragraphs. To train our reader model, we set the learning rate to 3 x 10^-5, and the maximum number of training epochs to two. Empirically we observe better performance with a larger batch size as discussed in previous work (Liu et al., 2019; Ott et al., 2018), and thus we set the mini-batch size to 120. |