Which Shortcut Solution Do Question Answering Models Prefer to Learn?
Authors: Kazutoshi Shinoda, Saku Sugawara, Akiko Aizawa
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Behavioral tests using biased training sets reveal that shortcuts that exploit answer positions and word-label correlations are preferentially learned for extractive and multiple-choice QA, respectively. We find that the more learnable a shortcut is, the flatter and deeper the loss landscape is around the shortcut solution in the parameter space. We also find that the availability of the preferred shortcuts tends to make the task easier to perform from an information-theoretic viewpoint. Lastly, we experimentally show that the learnability of shortcuts can be utilized to construct an effective QA training set; the more learnable a shortcut is, the smaller the proportion of anti-shortcut examples required to achieve comparable performance on shortcut and anti-shortcut examples. |
| Researcher Affiliation | Academia | Kazutoshi Shinoda1,2, Saku Sugawara2, Akiko Aizawa1,2 1The University of Tokyo 2National Institute of Informatics shinoda@is.s.u-tokyo.ac.jp, {saku, aizawa}@nii.ac.jp |
| Pseudocode | No | The paper describes its methods in prose and includes mathematical equations, but it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our codes are publicly available at https://github.com/ Kazutoshi Shinoda/Shortcut Learnability. |
| Open Datasets | Yes | For extractive QA, we used SQu AD 1.1 (Rajpurkar et al. 2016) and Natural Questions (Kwiatkowski et al. 2019), which contain more than thousand examples in the biased training sets in Figure 1. For multiple-choice QA, we used RACE (Lai et al. 2017) and Re Clor (Yu et al. 2020), where option-only models can perform better than the random baselines (Sugawara et al. 2020; Yu et al. 2020), suggesting that options in these datasets have unintended biases. |
| Dataset Splits | No | The paper mentions 'training sets' and 'evaluation sets' and describes experiments on varying proportions of anti-shortcut examples within training sets, but it does not specify explicit numerical splits for training, validation, and test sets (e.g., 80/10/10) or provide details for reproducing the data partitioning into these distinct sets. |
| Hardware Specification | No | The paper mentions the use of BERT-base and RoBERTa-base models, but it does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'BERT-base' and 'RoBERTa-base' as encoders and 'spaCy' for named entity recognition, but it does not specify version numbers for these or any other software components, which is necessary for reproducible ancillary software details. |
| Experiment Setup | No | The paper states, 'Except for the training steps, we followed the hyperparameters suggested by the original papers,' which delegates the specific setup details to external sources rather than explicitly providing them within the paper. It does not list specific hyperparameter values or comprehensive training configurations. |