Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets
Authors: Saku Sugawara, Pontus Stenetorp, Kentaro Inui, Akiko Aizawa8918-8927
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on 10 datasets (e.g., Co QA, SQu AD v2.0, and RACE) with a strong baseline model show that, for example, the relative scores of the baseline model provided with content words only and with shuffled sentence words in the context are on average 89.2% and 78.5% of the original scores, respectively. |
| Researcher Affiliation | Academia | Saku Sugawara University of Tokyo sakus@is.s.u-tokyo.ac.jp Pontus Stenetorp University College London p.stenetorp@cs.ucl.ac.uk Kentaro Inui Tohoku University and RIKEN Center for AIP inui@ecei.tohoku.ac.jp Akiko Aizawa National Institute of Informatics aizawa@nii.ac.jp |
| Pseudocode | No | The paper describes methodologies in prose and tables but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any links to or explicit statements about the release of its source code for the described methodology. |
| Open Datasets | Yes | We use 10 datasets. For answer extraction datasets in which a reader chooses a text span in a given context, we use (1) Co QA (Reddy, Chen, and Manning 2019), (2) Duo RC (Saha et al. 2018), (3) Hotpot QA (distractor) (Yang et al. 2018), (4) SQu AD v1.1 (Rajpurkar et al. 2016), and (5) SQu AD v2.0 (Rajpurkar, Jia, and Liang 2018). For multiple choice datasets in which a reader chooses a correct option from multiple options, we use (6) ARC (Challenge) (Clark et al. 2018), (7) MCTest (Richardson, Burges, and Renshaw 2013), (8) Multi RC (Khashabi et al. 2018), (9) RACE (Lai et al. 2017), and (10) SWAG (Zellers et al. 2018). |
| Dataset Splits | Yes | For the main analysis, we applied our ablation methods to development sets. ... We fine-tuned it on the original training set of each dataset and evaluated it on a modified development set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions software like BERT-large, Core NLP, and NLTK, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | Table 6: Hyperparameters used in the experiments, where d is the size of the token sequence fed into the model, b is the training batch size, lr is the learning rate, and ep is the number of training epochs. We set the learning rate warmup in RACE to 0.05 and 0.1 for the other datasets. We used stride = 128 for documents longer than d tokens. |