reproducibilityindex.ai

SciTaiL: A Textual Entailment Dataset from Science Question Answering

Authors: Tushar Khot, Ashish Sabharwal, Peter Clark

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a new dataset and model for textual entailment, derived from treating multiple-choice question-answering as an entailment problem. SCITAIL is the ﬁrst entailment set that is created solely from natural sentences that already exist independently in the wild rather than sentences authored speciﬁcally for the entailment task. ... The resulting challenge is evidenced by state-of-the-art textual entailment systems achieving mediocre performance on SCITAIL, especially in comparison to a simple majority class baseline. As a step forward, we demonstrate that one can improve accuracy on SCITAIL by 5% using a new neural model that exploits linguistic structure.
Researcher Affiliation	Industry	Tushar Khot, Ashish Sabharwal, Peter Clark Allen Institute for Artiﬁcial Intelligence, Seattle, WA, U.S.A. {tushark,ashishs,peterc}@allenai.org
Pseudocode	No	The paper does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	Our implementation is also available from the dataset page at http://data.allenai.org/scitail.
Open Datasets	Yes	Our ﬁnal released dataset is available at http://data.allenai.org/scitail/ along with the raw annotations collected for all the questions. We use multiple-choice science questions from publicly released 4th grade (204 questions) and 8th grade (195 questions) exams4 and the crowd-sourced questions from Sci Q dataset (2,835 questions) (Welbl, Liu, and Gardner 2017) to create Q.
Dataset Splits	Yes	We use the same train/dev/test splits from the original question sets so that QA systems trained on this dataset can be evaluated against the original test questions. Table 4 gives the distribution of examples and questions in our splits.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	We implement our model using Allen NLP toolkit (Gardner et al. 2017) in Py Torch. We use the 300-dimensional 840B Glove embeddings (Pennington, Socher, and Manning 2014) projected down to 100 dimensions. We used the cross-entropy loss with Adam optimization (Kingma and Ba 2015). (No specific version numbers for software components are provided).
Experiment Setup	Yes	We use the 300-dimensional 840B Glove embeddings (Pennington, Socher, and Manning 2014) projected down to 100 dimensions. We set the dimensionality of the hidden vectors in LSTM and MLPe as 100. We used the cross-entropy loss with Adam optimization (Kingma and Ba 2015). We halved the learning rate at every epoch and used early-stopping (patience=20) based on the validation set accuracy. We set the dropout to 0.5 and the edge embedding dimensionality to 10. We selected these parameters based on the accuracies on the validation set.