Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Read + Verify: Machine Reading Comprehension with Unanswerable Questions
Authors: Minghao Hu, Furu Wei, Yuxing Peng, Zhen Huang, Nan Yang, Dongsheng Li6529-6537
AAAI 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on the SQu AD 2.0 dataset show that our system obtains a score of 74.2 F1 on test set, achieving state-of-the-art results at the time of submission (Aug. 28th, 2018). |
| Researcher Affiliation | Collaboration | 1College of Computer, National University of Defense Technology 2Microsoft Research Asia |
| Pseudocode | No | The paper includes architectural diagrams (Figure 2) but no formal pseudocode blocks or algorithms. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | We evaluate our approach on the SQu AD 2.0 dataset (Rajpurkar, Jia, and Liang 2018). |
| Dataset Splits | Yes | We tune this threshold to maximize F1 score on the development set, and report both of EM (Exact Match) and F1 metrics. |
| Hardware Specification | No | The paper does not specify any hardware details such as GPU models, CPU types, or memory used for the experiments. |
| Software Dependencies | No | The paper mentions using 'the nltk tokenizer' but does not provide a specific version number for nltk or any other software dependency. |
| Experiment Setup | Yes | We run a grid search on γ and λ among [0.1, 0.3, 0.5, 0.7, 1, 2]. Based on the performance on development set, we set γ as 0.3 and λ to be 1. ... For Model-II, the Adam optimizer (Kingma and Ba 2014) with a learning rate of 0.0008 is used, the hidden size is set as 300, and a dropout (Srivastava et al. 2014) of 0.3 is applied for preventing overfitting. The batch size is 48 for the reader, 64 for Model-II, and 32 for Model-I as well as Model-III. |