Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets

Authors: Saku Sugawara, Pontus Stenetorp, Kentaro Inui, Akiko Aizawa8918-8927

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on 10 datasets (e.g., Co QA, SQu AD v2.0, and RACE) with a strong baseline model show that, for example, the relative scores of the baseline model provided with content words only and with shuffled sentence words in the context are on average 89.2% and 78.5% of the original scores, respectively.
Researcher Affiliation Academia Saku Sugawara University of Tokyo sakus@is.s.u-tokyo.ac.jp Pontus Stenetorp University College London p.stenetorp@cs.ucl.ac.uk Kentaro Inui Tohoku University and RIKEN Center for AIP inui@ecei.tohoku.ac.jp Akiko Aizawa National Institute of Informatics aizawa@nii.ac.jp
Pseudocode No The paper describes methodologies in prose and tables but does not include any explicit pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any links to or explicit statements about the release of its source code for the described methodology.
Open Datasets Yes We use 10 datasets. For answer extraction datasets in which a reader chooses a text span in a given context, we use (1) Co QA (Reddy, Chen, and Manning 2019), (2) Duo RC (Saha et al. 2018), (3) Hotpot QA (distractor) (Yang et al. 2018), (4) SQu AD v1.1 (Rajpurkar et al. 2016), and (5) SQu AD v2.0 (Rajpurkar, Jia, and Liang 2018). For multiple choice datasets in which a reader chooses a correct option from multiple options, we use (6) ARC (Challenge) (Clark et al. 2018), (7) MCTest (Richardson, Burges, and Renshaw 2013), (8) Multi RC (Khashabi et al. 2018), (9) RACE (Lai et al. 2017), and (10) SWAG (Zellers et al. 2018).
Dataset Splits Yes For the main analysis, we applied our ablation methods to development sets. ... We fine-tuned it on the original training set of each dataset and evaluated it on a modified development set.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions software like BERT-large, Core NLP, and NLTK, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes Table 6: Hyperparameters used in the experiments, where d is the size of the token sequence fed into the model, b is the training batch size, lr is the learning rate, and ep is the number of training epochs. We set the learning rate warmup in RACE to 0.05 and 0.1 for the other datasets. We used stride = 128 for documents longer than d tokens.