Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Which Shortcut Solution Do Question Answering Models Prefer to Learn?

Authors: Kazutoshi Shinoda, Saku Sugawara, Akiko Aizawa

AAAI 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Behavioral tests using biased training sets reveal that shortcuts that exploit answer positions and word-label correlations are preferentially learned for extractive and multiple-choice QA, respectively. We find that the more learnable a shortcut is, the flatter and deeper the loss landscape is around the shortcut solution in the parameter space. We also find that the availability of the preferred shortcuts tends to make the task easier to perform from an information-theoretic viewpoint. Lastly, we experimentally show that the learnability of shortcuts can be utilized to construct an effective QA training set; the more learnable a shortcut is, the smaller the proportion of anti-shortcut examples required to achieve comparable performance on shortcut and anti-shortcut examples.
Researcher Affiliation	Academia	Kazutoshi Shinoda1,2, Saku Sugawara2, Akiko Aizawa1,2 1The University of Tokyo 2National Institute of Informatics EMAIL, EMAIL
Pseudocode	No	The paper describes its methods in prose and includes mathematical equations, but it does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our codes are publicly available at https://github.com/ Kazutoshi Shinoda/Shortcut Learnability.
Open Datasets	Yes	For extractive QA, we used SQu AD 1.1 (Rajpurkar et al. 2016) and Natural Questions (Kwiatkowski et al. 2019), which contain more than thousand examples in the biased training sets in Figure 1. For multiple-choice QA, we used RACE (Lai et al. 2017) and Re Clor (Yu et al. 2020), where option-only models can perform better than the random baselines (Sugawara et al. 2020; Yu et al. 2020), suggesting that options in these datasets have unintended biases.
Dataset Splits	No	The paper mentions 'training sets' and 'evaluation sets' and describes experiments on varying proportions of anti-shortcut examples within training sets, but it does not specify explicit numerical splits for training, validation, and test sets (e.g., 80/10/10) or provide details for reproducing the data partitioning into these distinct sets.
Hardware Specification	No	The paper mentions the use of BERT-base and RoBERTa-base models, but it does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments.
Software Dependencies	No	The paper mentions using 'BERT-base' and 'RoBERTa-base' as encoders and 'spaCy' for named entity recognition, but it does not specify version numbers for these or any other software components, which is necessary for reproducible ancillary software details.
Experiment Setup	No	The paper states, 'Except for the training steps, we followed the hyperparameters suggested by the original papers,' which delegates the specific setup details to external sources rather than explicitly providing them within the paper. It does not list specific hyperparameter values or comprehensive training configurations.