reproducibilityindex.ai

Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets

Authors: Saku Sugawara, Pontus Stenetorp, Kentaro Inui, Akiko Aizawa8918-8927

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on 10 datasets (e.g., Co QA, SQu AD v2.0, and RACE) with a strong baseline model show that, for example, the relative scores of the baseline model provided with content words only and with shufﬂed sentence words in the context are on average 89.2% and 78.5% of the original scores, respectively.
Researcher Affiliation	Academia	Saku Sugawara University of Tokyo sakus@is.s.u-tokyo.ac.jp Pontus Stenetorp University College London p.stenetorp@cs.ucl.ac.uk Kentaro Inui Tohoku University and RIKEN Center for AIP inui@ecei.tohoku.ac.jp Akiko Aizawa National Institute of Informatics aizawa@nii.ac.jp
Pseudocode	No	The paper describes methodologies in prose and tables but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any links to or explicit statements about the release of its source code for the described methodology.
Open Datasets	Yes	We use 10 datasets. For answer extraction datasets in which a reader chooses a text span in a given context, we use (1) Co QA (Reddy, Chen, and Manning 2019), (2) Duo RC (Saha et al. 2018), (3) Hotpot QA (distractor) (Yang et al. 2018), (4) SQu AD v1.1 (Rajpurkar et al. 2016), and (5) SQu AD v2.0 (Rajpurkar, Jia, and Liang 2018). For multiple choice datasets in which a reader chooses a correct option from multiple options, we use (6) ARC (Challenge) (Clark et al. 2018), (7) MCTest (Richardson, Burges, and Renshaw 2013), (8) Multi RC (Khashabi et al. 2018), (9) RACE (Lai et al. 2017), and (10) SWAG (Zellers et al. 2018).
Dataset Splits	Yes	For the main analysis, we applied our ablation methods to development sets. ... We ﬁne-tuned it on the original training set of each dataset and evaluated it on a modiﬁed development set.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions software like BERT-large, Core NLP, and NLTK, but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	Table 6: Hyperparameters used in the experiments, where d is the size of the token sequence fed into the model, b is the training batch size, lr is the learning rate, and ep is the number of training epochs. We set the learning rate warmup in RACE to 0.05 and 0.1 for the other datasets. We used stride = 128 for documents longer than d tokens.