reproducibilityindex.ai

The Effect of Natural Distribution Shift on Question Answering Models

Authors: John Miller, Karl Krauth, Benjamin Recht, Ludwig Schmidt

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We build four new test sets for the Stanford Question Answering Dataset (SQu AD) and evaluate the ability of question-answering systems to generalize to new data. Across a broad range of models, we observe average performance drops of 3.8, 14.0, and 17.4 F1 points, respectively.
Researcher Affiliation	Academia	John Miller 1 Karl Krauth 1 Benjamin Recht 1 Ludwig Schmidt 1 1Department of Computer Science, University of California, Berkeley, Berkeley, California, USA.
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	To enable future research, all of our new tests sets are freely available online.1 1https://modestyachts.github.io/squadshifts-website/ This only provides access to the datasets, not the source code for any methodology presented in the paper.
Open Datasets	Yes	To enable future research, all of our new tests sets are freely available online.1 1https://modestyachts.github.io/squadshifts-website/ and Since its release in 2016, the Stanford Question Answering Dataset (SQu AD) (Rajpurkar et al., 2016) has generated intense interest from the natural language processing community.
Dataset Splits	Yes	The SQu AD test set is not publically available. Therefore, while we use public test set evaluation numbers, we use the public development set for analysis. (Table 1: SQu AD v1.1 Dev 48 10,570)
Hardware Specification	No	The paper does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions tools like 'spa Cy (Honnibal & Montani, 2017)' but does not provide specific version numbers for software dependencies used in the experiments.
Experiment Setup	No	All of the models were submitted to the Coda Lab platform, and we evaluate every model using the exact same conﬁguration (model weights, hyperparameters, command-line arguments, execution environment) as the original submission. This statement refers to the configurations of previously submitted models, not specific hyperparameter or training setup details provided within this paper for their own evaluation framework.