Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

GRAML: Goal Recognition As Metric Learning

Authors: Matan Shamir, Reuth Mirsky

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluated on a versatile set of environments, GRAML shows speed, flexibility, and runtime improvements over the state-of-the-art GR while maintaining accurate recognition.
Researcher Affiliation	Academia	Matan Shamir1 , Reuth Mirsky1,2 1Computer Science Department, Bar-Ilan University, Israel 2Computer Science Department, Tufts University, MA, USA EMAIL, EMAIL
Pseudocode	No	The paper describes steps in regular paragraph text without structured formatting, and no figures are labeled as pseudocode or algorithm.
Open Source Code	Yes	1https://github.com/Matan Shamir1/Grlib
Open Datasets	Yes	Building on the GCRL survey and the benchmark environments suggested at Apex RL 2, we form a collection of GR problems from several sets of environments that adhere to the Gymnasium API 3, with detailed descriptions of each in Appendix ??. We consider two custom Minigrid environments from the minigrid package [Chevalier-Boisvert et al., 2023], two custom Point Maze environments from the Gymnasium-Robotics package [Fu et al., 2020], the Parking environment from the highway-env package [Leurent, 2018], and the Reach environment from Panda Gym [Gallou edec et al., 2021].
Dataset Splits	No	The paper mentions varying observation sequence lengths (30%, 50%, 70%, 100%) and generating '200 GR problems per scenario' but does not specify training, validation, or test dataset splits in a way that allows reproduction of data partitioning.
Hardware Specification	Yes	All experiments were conducted on a commodity Intel i-7 pro.
Software Dependencies	No	The paper mentions software like Python, PyTorch, Stable Baselines3, Gymnasium API, minigrid package, Gymnasium-Robotics package, highway-env package, and Panda Gym, but does not provide specific version numbers for these components.
Experiment Setup	Yes	Each single-goal agent was trained for 300,000 timesteps, and the goal-conditioned agent was trained for 1 million timesteps. ... G was set to 20, while BG-GRAML used only 5. ... For each environment, we tested observation sequences that are 30%, 50%, 70%, and 100% of the full sequence, both consecutively and non-consecutively.