Question Answering as Global Reasoning Over Semantic Abstractions
Authors: Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Dan Roth
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our system, SEMANTICILP, demonstrates strong performance on two domains simultaneously. In particular, on a collection of challenging science QA datasets, it outperforms various state-of-the-art approaches, including neural models, broad coverage information retrieval, and specialized techniques using structured knowledge bases, by 2%-6%. We present a new QA system, SEMANTICILP,2 based on these ideas, and evaluate it on multiple-choice questions from two domains involving rich linguistic structure and reasoning: elementary and middle-school level science exams, and early-college level biology reading comprehension. |
| Researcher Affiliation | Collaboration | Daniel Khashabi University of Pennsylvania danielkh@cis.upenn.edu Tushar Khot Ashish Sabharwal Allen Institute for Artificial Intelligence (AI2) tushark,ashishs@allenai.org Dan Roth University of Pennsylvania danroth@cis.upenn.edu |
| Pseudocode | No | The paper describes its Integer Linear Program (ILP) formulation and various components but does not include explicit pseudocode blocks or algorithms. |
| Open Source Code | Yes | Code available at: https://github.com/allenai/semanticilp |
| Open Datasets | Yes | For the first domain, we have a collection of question sets containing elementary-level science questions from standardized tests (Clark et al. 2016; Khot, Sabharwal, and Clark 2017). Specifically, REGENTS 4TH contains all nondiagram multiple choice questions from 6 years of NY Regents 4th grade science exams (127 train questions, 129 test). ... For the second domain, we use the PROCESSBANK8 dataset for the reading comprehension task proposed by Berant et al. (2014). ... The resulting dataset has 293 train and 109 test questions, based on 147 biology paragraphs. |
| Dataset Splits | No | The paper provides train and test set sizes (e.g., "127 train questions, 129 test") but does not explicitly mention or quantify validation dataset splits. |
| Hardware Specification | No | The paper mentions "ILP complexity" and "Timing stats" including "model creation" and "solving the ILP" times (Table 6), but it does not specify any hardware details like CPU, GPU models, or memory used for these operations. |
| Software Dependencies | No | The paper mentions "open source SCIP engine (Achterberg 2009)" and "ALLENNLP re-implementation of BIDAF" but does not provide specific version numbers for these software components. |
| Experiment Setup | No | The paper states that weights for the ensemble of solvers are "trained using the union of training data from all questions sets" and mentions different annotator combinations, but it does not provide specific hyperparameter values such as learning rates, batch sizes, or optimizer settings for any of the models or the ensemble training process. |