SciTaiL: A Textual Entailment Dataset from Science Question Answering

Authors: Tushar Khot, Ashish Sabharwal, Peter Clark

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a new dataset and model for textual entailment, derived from treating multiple-choice question-answering as an entailment problem. SCITAIL is the first entailment set that is created solely from natural sentences that already exist independently in the wild rather than sentences authored specifically for the entailment task. ... The resulting challenge is evidenced by state-of-the-art textual entailment systems achieving mediocre performance on SCITAIL, especially in comparison to a simple majority class baseline. As a step forward, we demonstrate that one can improve accuracy on SCITAIL by 5% using a new neural model that exploits linguistic structure.
Researcher Affiliation Industry Tushar Khot, Ashish Sabharwal, Peter Clark Allen Institute for Artificial Intelligence, Seattle, WA, U.S.A. {tushark,ashishs,peterc}@allenai.org
Pseudocode No The paper does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes Our implementation is also available from the dataset page at http://data.allenai.org/scitail.
Open Datasets Yes Our final released dataset is available at http://data.allenai.org/scitail/ along with the raw annotations collected for all the questions. We use multiple-choice science questions from publicly released 4th grade (204 questions) and 8th grade (195 questions) exams4 and the crowd-sourced questions from Sci Q dataset (2,835 questions) (Welbl, Liu, and Gardner 2017) to create Q.
Dataset Splits Yes We use the same train/dev/test splits from the original question sets so that QA systems trained on this dataset can be evaluated against the original test questions. Table 4 gives the distribution of examples and questions in our splits.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No We implement our model using Allen NLP toolkit (Gardner et al. 2017) in Py Torch. We use the 300-dimensional 840B Glove embeddings (Pennington, Socher, and Manning 2014) projected down to 100 dimensions. We used the cross-entropy loss with Adam optimization (Kingma and Ba 2015). (No specific version numbers for software components are provided).
Experiment Setup Yes We use the 300-dimensional 840B Glove embeddings (Pennington, Socher, and Manning 2014) projected down to 100 dimensions. We set the dimensionality of the hidden vectors in LSTM and MLPe as 100. We used the cross-entropy loss with Adam optimization (Kingma and Ba 2015). We halved the learning rate at every epoch and used early-stopping (patience=20) based on the validation set accuracy. We set the dropout to 0.5 and the edge embedding dimensionality to 10. We selected these parameters based on the accuracies on the validation set.