QASC: A Dataset for Question Answering via Sentence Composition

Authors: Tushar Khot, Peter Clark, Michal Guerquin, Peter Jansen, Ashish Sabharwal8082-8090

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our proposed approach improves over current state-of-the-art language models by 11% (absolute). The reasoning and retrieval problems, however, remain unsolved as this model still lags by 20% behind human performance. Table 5: QASC scores for previous state-of-the-art models on multi-hop Science MCQ(OBQA), and BERT models with different corpora, retrieval approaches and additional fine-tuning.
Researcher Affiliation Collaboration Tushar Khot, Peter Clark, Michal Guerquin, Peter Jansen,+ Ashish Sabharwal Allen Institute for AI, Seattle, WA, U.S.A. +University of Arizona, Tucson, AZ, U.S.A. {tushark, peterc, michalg, ashishs}@allenai.org, pajansen@email.arizona.edu
Pseudocode No The paper describes methods verbally but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code No 1Questions, annotated facts, and corpora are available at https://github.com/allenai/qasc. The link provides access to the dataset and facts, but not explicitly the source code for the described methodologies (e.g., retrieval approach, adversarial choice selection).
Open Datasets Yes We propose a novel dataset, Question Answering via Sentence Composition (QASC; pronounced kask) of 9,980 multi-hop multiple-choice questions (MCQs) where simple syntactic cues are insufficient to determine how to decompose the question into simpler queries. 1Questions, annotated facts, and corpora are available at https://github.com/allenai/qasc.
Dataset Splits Yes To enable fine-tuning models, we split the questions them into 5962/825/873 questions in train/dev/test folds, resp.
Hardware Specification No The paper states 'Computations on beaker.org were supported in part by credits from Google Cloud' but does not provide specific hardware details such as GPU models, CPU models, or memory specifications.
Software Dependencies No The paper mentions software like BERT, spaCy, langdetect, and ftfy, but does not provide specific version numbers for these software components or libraries.
Experiment Setup No For consistency, we use the same hyper-parameter sweep in all finetuning experiments (cf. Appendix D). The paper refers to an appendix for hyperparameter details, which is not provided in the current text.