Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning to Query, Reason, and Answer Questions On Ambiguous Texts

Authors: Xiaoxiao Guo, Tim Klinger, Clemens Rosenbaum, Joseph P. Bigus, Murray Campbell, Ban Kawas, Kartik Talamadupula, Gerry Tesauro, Satinder Singh

ICLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our architectures on four QRAQ dataset types, and scale the complexity for each along multiple dimensions. We evaluate our methods on four types of datasets described below. Each dataset contains 107,000 QRAQ problems, with 100,000 for training, 2000 for testing, and 5000 for validation.
Researcher Affiliation	Collaboration	Xiaoxiao Guo Computer Science & Engineering University of Michigan EMAIL Tim Klinger IBM Watson Research Yorktown Heights, NY EMAIL Clemens Rosenbaum Computer Science UMass Amherst EMAIL Joseph P. Bigus, Murray Campbell, Ban Kawas, Kartik Talamadupula, Gerald Tesauro IBM Watson Research Yorktown Heights, NY jbigus,mcam,bkawas,krtalamad,EMAIL Satinder Singh Computer Science & Engineering University of Michigan EMAIL
Pseudocode	No	The paper describes the control flow and architectures but does not include structured pseudocode or algorithm blocks.
Open Source Code	No	We will include a detailed description of the simulator and this algorithm when we release the QRAQ datasets to the research community.
Open Datasets	No	Each dataset contains 107,000 QRAQ problems, with 100,000 for training, 2000 for testing, and 5000 for validation. We will include a detailed description of the simulator and this algorithm when we release the QRAQ datasets to the research community.
Dataset Splits	Yes	Each dataset contains 107,000 QRAQ problems, with 100,000 for training, 2000 for testing, and 5000 for validation.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running the experiments.
Software Dependencies	No	The paper mentions using 'Adam (Kingma & Ba (2015))' but does not provide version numbers for Adam or any other software dependencies.
Experiment Setup	Yes	The number of memory hops is ﬁxed to 4. The embedding dimensionality is ﬁxed to 50. ... Speciﬁcally, the rewards is +1 for correct ﬁnal answers, -5 for wrong ﬁnal answers. We explored ﬁve pairs of query reward values for the curriculum: +/-0.01, +/-0.05, +/-0.1, +/-0.5, +/-1, and found that +/-0.05 performed best on a validation set, so that is what we use for our experiments. ... For our experiments, ϵ = 0.1.