Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the Generation of Medical Question-Answer Pairs

Authors: Sheng Shen, Yaliang Li, Nan Du, Xian Wu, Yusheng Xie, Shen Ge, Tao Yang, Kai Wang, Xingzheng Liang, Wei Fan8822-8829

AAAI 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	A series of experiments have been conducted on a real-world dataset collected from the National Medical Licensing Examination of China. Both automatic evaluation and human annotation demonstrate the effectiveness of the proposed method. Further investigation shows that, by incorporating the generated QA pairs for training, significant improvement in terms of accuracy can be achieved for the examination QA system.
Researcher Affiliation	Collaboration	Sheng Shen,1 Yaliang Li,2 Nan Du,3 Xian Wu,3 Yusheng Xie,3 Shen Ge,3 Tao Yang,3 Kai Wang,3 Xingzheng Liang,3 Wei Fan3 1University of California at Berkeley, 2Alibaba Group, 3Tencent
Pseudocode	No	The paper describes its methods textually in the 'Methodology' section but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper states, 'Our full version paper with supplemented material is publicly available at https://arxiv.org/abs/1811.00681,' but this link is to the paper itself, not a code repository. There is no explicit statement about releasing source code for the described methodology.
Open Datasets	No	The paper states, 'We collect real-world medical QA pairs from the National Medical Licensing Examination of China (denoted as NMLEC QA).' It mentions a medical entity dictionary from 'medical Wikipedia-style pages3' with a URL (xywy.com), but this is not a dataset repository, and the NMLEC QA dataset is collected by the authors without a provided access method.
Dataset Splits	No	The paper mentions using 'NMLEC 2017 as the test set,' but does not provide specific percentages or counts for training, validation, and test splits, nor does it refer to standard predefined splits for the dataset collected.
Hardware Specification	No	The paper does not provide any specific details regarding the hardware used for conducting the experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions 'Elasticsearch2' and 'Bi-LSTM-CRF model' but does not provide specific version numbers for these or any other software components, which is necessary for reproducible setup.
Experiment Setup	No	The paper describes the model architecture and evaluation metrics but does not provide specific details about the experimental setup such as hyperparameters (e.g., learning rate, batch size, number of epochs) or optimizer settings.