The Effect of Natural Distribution Shift on Question Answering Models

Authors: John Miller, Karl Krauth, Benjamin Recht, Ludwig Schmidt

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We build four new test sets for the Stanford Question Answering Dataset (SQu AD) and evaluate the ability of question-answering systems to generalize to new data. Across a broad range of models, we observe average performance drops of 3.8, 14.0, and 17.4 F1 points, respectively.
Researcher Affiliation Academia John Miller 1 Karl Krauth 1 Benjamin Recht 1 Ludwig Schmidt 1 1Department of Computer Science, University of California, Berkeley, Berkeley, California, USA.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No To enable future research, all of our new tests sets are freely available online.1 1https://modestyachts.github.io/squadshifts-website/ This only provides access to the datasets, not the source code for any methodology presented in the paper.
Open Datasets Yes To enable future research, all of our new tests sets are freely available online.1 1https://modestyachts.github.io/squadshifts-website/ and Since its release in 2016, the Stanford Question Answering Dataset (SQu AD) (Rajpurkar et al., 2016) has generated intense interest from the natural language processing community.
Dataset Splits Yes The SQu AD test set is not publically available. Therefore, while we use public test set evaluation numbers, we use the public development set for analysis. (Table 1: SQu AD v1.1 Dev 48 10,570)
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions tools like 'spa Cy (Honnibal & Montani, 2017)' but does not provide specific version numbers for software dependencies used in the experiments.
Experiment Setup No All of the models were submitted to the Coda Lab platform, and we evaluate every model using the exact same configuration (model weights, hyperparameters, command-line arguments, execution environment) as the original submission. This statement refers to the configurations of previously submitted models, not specific hyperparameter or training setup details provided within this paper for their own evaluation framework.