reproducibilityindex.ai

QASC: A Dataset for Question Answering via Sentence Composition

Authors: Tushar Khot, Peter Clark, Michal Guerquin, Peter Jansen, Ashish Sabharwal8082-8090

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our proposed approach improves over current state-of-the-art language models by 11% (absolute). The reasoning and retrieval problems, however, remain unsolved as this model still lags by 20% behind human performance. Table 5: QASC scores for previous state-of-the-art models on multi-hop Science MCQ(OBQA), and BERT models with different corpora, retrieval approaches and additional ﬁne-tuning.
Researcher Affiliation	Collaboration	Tushar Khot, Peter Clark, Michal Guerquin, Peter Jansen,+ Ashish Sabharwal Allen Institute for AI, Seattle, WA, U.S.A. +University of Arizona, Tucson, AZ, U.S.A. {tushark, peterc, michalg, ashishs}@allenai.org, pajansen@email.arizona.edu
Pseudocode	No	The paper describes methods verbally but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	1Questions, annotated facts, and corpora are available at https://github.com/allenai/qasc. The link provides access to the dataset and facts, but not explicitly the source code for the described methodologies (e.g., retrieval approach, adversarial choice selection).
Open Datasets	Yes	We propose a novel dataset, Question Answering via Sentence Composition (QASC; pronounced kask) of 9,980 multi-hop multiple-choice questions (MCQs) where simple syntactic cues are insufﬁcient to determine how to decompose the question into simpler queries. 1Questions, annotated facts, and corpora are available at https://github.com/allenai/qasc.
Dataset Splits	Yes	To enable ﬁne-tuning models, we split the questions them into 5962/825/873 questions in train/dev/test folds, resp.
Hardware Specification	No	The paper states 'Computations on beaker.org were supported in part by credits from Google Cloud' but does not provide specific hardware details such as GPU models, CPU models, or memory specifications.
Software Dependencies	No	The paper mentions software like BERT, spaCy, langdetect, and ftfy, but does not provide specific version numbers for these software components or libraries.
Experiment Setup	No	For consistency, we use the same hyper-parameter sweep in all ﬁnetuning experiments (cf. Appendix D). The paper refers to an appendix for hyperparameter details, which is not provided in the current text.