Semantics Altering Modifications for Evaluating Comprehension in Machine Reading

Authors: Viktor Schlegel, Goran Nenadic, Riza Batista-Navarro13762-13770

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In a large-scale empirical study, we apply the methodology in order to evaluate extractive MRC models with regard to their capability to correctly process SAM-enriched data. We comprehensively cover 12 different state-of-the-art neural architecture configurations and four training datasets and find that despite their well-known remarkable performance optimised models consistently struggle to correctly process semantically altered data.
Researcher Affiliation Academia Viktor Schlegel, Goran Nenadic and Riza Batista-Navarro Department of Computer Science, University of Manchester Manchester, United Kingdom {viktor.schlegel, gnenadic, riza.batista}@manchester.ac.uk
Pseudocode No The paper describes its generative process with a diagram (Figure 2) but does not provide any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and Supplementary Materials SM1, SM2 and SM3 can be retrieved from https://github.com/schlevik/sam
Open Datasets Yes SQUAD (Rajpurkar et al. 2016) is a widely studied dataset where the human baseline is surpassed by the state of the art. HOTPOTQA (Yang et al. 2018) in the distractor setting requires information synthesis from multiple passages in the context connected by a common entity or its property. DROP (Dua et al. 2019) requires performing simple arithmetical tasks in order to predict the correct answer. NEWSQA (Trischler et al. 2017) contains questions that were created without having access to the provided context.
Dataset Splits Yes When fine-tuning MRC models on our generated data, we separate the seed templates in two distinct sets, in order to ensure that the models do not perform well by just memorising the templates. These template sets are used to generate a training (12000 instances) and an evaluation (2400 instances) set with aligned baseline, intervention and control instances. (...) Table 1: DICE and EM/F1 score on the corresponding development sets of the evaluated models.
Hardware Specification No The paper mentions: "The computation-heavy aspects of this paper were made possible due to access to the Computational Shared Facility at The University of Manchester, for which the authors are grateful." However, this does not provide specific hardware details such as GPU or CPU models.
Software Dependencies No The paper mentions various models (BERT, RoBERTa, ALBERT, T5, BiDAF) but does not specify the versions of software, libraries, or frameworks used for their implementation or experiments.
Experiment Setup No The paper describes general aspects of model architecture and training (e.g., "concatenate the question and context, and optimise the parameters... to minimise the cross-entropy loss") but does not provide specific hyperparameters like learning rate, batch size, or number of epochs.