The Effect of Natural Distribution Shift on Question Answering Models
Authors: John Miller, Karl Krauth, Benjamin Recht, Ludwig Schmidt
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We build four new test sets for the Stanford Question Answering Dataset (SQu AD) and evaluate the ability of question-answering systems to generalize to new data. Across a broad range of models, we observe average performance drops of 3.8, 14.0, and 17.4 F1 points, respectively. |
| Researcher Affiliation | Academia | John Miller 1 Karl Krauth 1 Benjamin Recht 1 Ludwig Schmidt 1 1Department of Computer Science, University of California, Berkeley, Berkeley, California, USA. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | To enable future research, all of our new tests sets are freely available online.1 1https://modestyachts.github.io/squadshifts-website/ This only provides access to the datasets, not the source code for any methodology presented in the paper. |
| Open Datasets | Yes | To enable future research, all of our new tests sets are freely available online.1 1https://modestyachts.github.io/squadshifts-website/ and Since its release in 2016, the Stanford Question Answering Dataset (SQu AD) (Rajpurkar et al., 2016) has generated intense interest from the natural language processing community. |
| Dataset Splits | Yes | The SQu AD test set is not publically available. Therefore, while we use public test set evaluation numbers, we use the public development set for analysis. (Table 1: SQu AD v1.1 Dev 48 10,570) |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions tools like 'spa Cy (Honnibal & Montani, 2017)' but does not provide specific version numbers for software dependencies used in the experiments. |
| Experiment Setup | No | All of the models were submitted to the Coda Lab platform, and we evaluate every model using the exact same configuration (model weights, hyperparameters, command-line arguments, execution environment) as the original submission. This statement refers to the configurations of previously submitted models, not specific hyperparameter or training setup details provided within this paper for their own evaluation framework. |