Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

StepGame: A New Benchmark for Robust Multi-Hop Spatial Reasoning in Texts

Authors: Zhengxiang Shi, Qiang Zhang, Aldo Lipani11321-11329

AAAI 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that state-of-the-art models on the b Ab I dataset struggle on the Step Game dataset. Moreover, we propose a Tensor-Product based Memory-Augmented Neural Network (TP-MANN) specialized for spatial reasoning tasks. Experimental results on both datasets show that our model outperforms all the baselines with superior generalization and robustness performance.
Researcher Affiliation	Academia	Zhengxiang Shi1, Qiang Zhang2, Aldo Lipani1 1University College London 2Zhejiang University
Pseudocode	No	The paper describes the model architecture and data generation process using text, diagrams (Figure 3), and mathematical formulas, but does not include explicit pseudocode or algorithm blocks.
Open Source Code	Yes	The software and data are available at: https://github.com/Zhengxiang Shi/Step Game
Open Datasets	Yes	In this paper, we present a new Question-Answering dataset called Step Game for robust multi-hop spatial reasoning in texts. ... The software and data are available at: https://github.com/Zhengxiang Shi/Step Game
Dataset Splits	Yes	For the b Ab I dataset we only focus on task 17 and task 19 and use the original train and test splits made of 10 000 samples for the training set and 1 000 for the validation and test sets. For the Step Game dataset, we generate a training set made of samples varying k from 1 to 5 at steps of 1, and a test set with k varying from 1 to 10. Moreover, the test set will also contain distracting noise. The ﬁnal dataset consists of, for each k value, 10 000 training samples, 1 000 validation samples, and 10 000 test samples.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions 'The software and data are available at: https://github.com/Zhengxiang Shi/Step Game' but does not list specific software dependencies with version numbers in the text.
Experiment Setup	No	All training details, including those for our model, are reported in the Appendix.