Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions

Authors: Peter Clark, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Turney, Daniel Khashabi

AAAI 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the methods on six years of unseen, unedited exam questions from the NY Regents Science Exam (using only non-diagram, multiple choice questions), and show that our overall system s score is 71.3%, an improvement of 23.8% (absolute) over the MLN-based method described in previous work. ... We carry out ablation studies that quantify the contribution of each method to Aristo, and show that all levels of representation help. Our error analysis indicates the complementary strengths and weaknesses of each method, and directions for future work.
Researcher Affiliation	Collaboration	Peter Clark, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Turney Allen Institute for Artiﬁcial Intelligence EMAIL Daniel Khashabi Cognitive Computation Lab (CCG), Univ Illinois, Urbana-Champaign EMAIL
Pseudocode	No	The paper describes methods and algorithms in prose but does not include structured pseudocode or algorithm blocks.
Open Source Code	No	The paper states 'Our datasets are being released to enable further research.' and 'We are releasing our datasets (at www.allenai.org) to encourage such research.' but does not explicitly state that the code for the described methodology is being released as open source.
Open Datasets	Yes	Our datasets are being released to enable further research. ... We are releasing our datasets (at www.allenai.org) to encourage such research.
Dataset Splits	No	The paper states '6 years of exams (108 NDMC questions) for training and 6 years (129 NDMC questions) for testing,' but does not explicitly mention a separate validation split with details.
Hardware Specification	No	The paper does not provide specific hardware details used for running the experiments.
Software Dependencies	No	The paper mentions software like 'Lucene' and 'SCIP' and cites 'SCIP (Achterberg 2009)', but does not provide explicit version numbers for these or other software dependencies within the text.
Experiment Setup	No	The paper describes the model architecture and training of the combiner, but does not provide specific hyperparameters or detailed training configurations for the experimental setup.