reproducibilityindex.ai

Question Answering as Global Reasoning Over Semantic Abstractions

Authors: Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Dan Roth

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our system, SEMANTICILP, demonstrates strong performance on two domains simultaneously. In particular, on a collection of challenging science QA datasets, it outperforms various state-of-the-art approaches, including neural models, broad coverage information retrieval, and specialized techniques using structured knowledge bases, by 2%-6%. We present a new QA system, SEMANTICILP,2 based on these ideas, and evaluate it on multiple-choice questions from two domains involving rich linguistic structure and reasoning: elementary and middle-school level science exams, and early-college level biology reading comprehension.
Researcher Affiliation	Collaboration	Daniel Khashabi University of Pennsylvania danielkh@cis.upenn.edu Tushar Khot Ashish Sabharwal Allen Institute for Artiﬁcial Intelligence (AI2) tushark,ashishs@allenai.org Dan Roth University of Pennsylvania danroth@cis.upenn.edu
Pseudocode	No	The paper describes its Integer Linear Program (ILP) formulation and various components but does not include explicit pseudocode blocks or algorithms.
Open Source Code	Yes	Code available at: https://github.com/allenai/semanticilp
Open Datasets	Yes	For the ﬁrst domain, we have a collection of question sets containing elementary-level science questions from standardized tests (Clark et al. 2016; Khot, Sabharwal, and Clark 2017). Speciﬁcally, REGENTS 4TH contains all nondiagram multiple choice questions from 6 years of NY Regents 4th grade science exams (127 train questions, 129 test). ... For the second domain, we use the PROCESSBANK8 dataset for the reading comprehension task proposed by Berant et al. (2014). ... The resulting dataset has 293 train and 109 test questions, based on 147 biology paragraphs.
Dataset Splits	No	The paper provides train and test set sizes (e.g., "127 train questions, 129 test") but does not explicitly mention or quantify validation dataset splits.
Hardware Specification	No	The paper mentions "ILP complexity" and "Timing stats" including "model creation" and "solving the ILP" times (Table 6), but it does not specify any hardware details like CPU, GPU models, or memory used for these operations.
Software Dependencies	No	The paper mentions "open source SCIP engine (Achterberg 2009)" and "ALLENNLP re-implementation of BIDAF" but does not provide specific version numbers for these software components.
Experiment Setup	No	The paper states that weights for the ensemble of solvers are "trained using the union of training data from all questions sets" and mentions different annotator combinations, but it does not provide specific hyperparameter values such as learning rates, batch sizes, or optimizer settings for any of the models or the ensemble training process.