Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

An Empirical Study of Content Understanding in Conversational Question Answering

Authors: Ting-Rui Chiang, Hao-Tong Ye, Yun-Nung Chen7578-7585

AAAI 2020 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results indicate some potential hazards in the benchmark datasets, Qu AC and Co QA, for conversational comprehension research. Our analysis also sheds light on both what models may learn and how datasets may bias the models.
Researcher Affiliation Academia Ting-Rui Chiang, Hao-Tong Ye, Yun-Nung Chen National Taiwan University, Taipei, Taiwan EMAIL, EMAIL
Pseudocode No The paper describes the models and experimental settings in text, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The source code is available at https://github.com/Miu Lab/CQA-Study.
Open Datasets Yes There are two benchmark conversational question answering datasets, Qu AC (Choi et al. 2018) and Co QA (Reddy, Chen, and Manning 2019).
Dataset Splits Yes Model performance on the validation set of Qu AC and Co QA.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running the experiments.
Software Dependencies No The paper mentions utilizing existing models such as Flow QA, BERT, and SDNet, but it does not specify any software dependencies with version numbers (e.g., Python version, library versions like PyTorch or TensorFlow) required for replication.
Experiment Setup Yes Each models of each setting are trained with 3 different random seeds, and the resulted mean and standard deviation value are reported for reliability.