Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hierarchical Planning for Complex Tasks with Knowledge Graph-RAG and Symbolic Verification

Authors: Flavio Petruzzellis, Cristina Cornelio, Pietro Lio

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our evaluation against baseline methods demonstrates the consistent significant advantages of integrating hierarchical planning, symbolic verification, and RAG across tasks of varying complexity and different LLMs. Additionally, our experimental setup and novel metrics not only validate our approach for complex planning but also serve as a tool for assessing LLMs reasoning and compositional capabilities.
Researcher Affiliation	Collaboration	1Department of Mathematics, University of Padova, Padova, Italy 2Samsung AI, Cambridge, UK 3Computer Science Department, University of Cambridge, Cambridge, UK. Correspondence to: Cristina Cornelio <EMAIL>.
Pseudocode	No	The paper describes methods and processes through textual descriptions and diagrams (Figure 1, Figure 2, Figure 3) rather than formal pseudocode or algorithm blocks.
Open Source Code	Yes	Code available at https://github. com/corneliocristina/HVR .
Open Datasets	Yes	We employed Onto Thor (Cornelio & Diab, 2024), which describes the AI2Thor kitchen environment, as our ontology.
Dataset Splits	No	The paper categorizes tasks by complexity (moderate and high) and states that tasks for T12 and T5bis have variable ground truth plans, but it does not specify explicit training, validation, or test dataset splits in terms of percentages, counts, or predefined partition files for any models or experiments.
Hardware Specification	No	The paper mentions using specific LLMs like Phi-3-mini-4k-instruct and gemini-1.5-flash (and gemini-2.0-flash in the appendix) but does not provide details about the underlying hardware (e.g., GPU models, CPU types, or memory specifications) on which these models were run or evaluated.
Software Dependencies	No	The paper mentions using AI2Thor, OntoThor, Phi-3-mini-4k-instruct, gemini-1.5-flash, PDDL, and a Python-based validator. However, it does not specify version numbers for Python or any of the libraries/frameworks used for implementation, which is necessary for reproducible software dependencies.
Experiment Setup	Yes	To generate macro actions, we employ a two-shot prompt template that incorporates the goal-oriented task description and the list of relevant objects with their properties, retrieved by the KG-RAG approach. A frozen LLM is fed with this prompt and outputs an ordered list of macro actions in natural language. ... To ensure consistent results, we employ in-context learning, providing two examples of plans for simple macro actions. ... When correcting an AA-block, for each step, the system attempts correction up to 2 x times, where x, the lenght of the action block dynamically updated based on the current number of steps. For example, starting with 5 steps (x = 5), if a correction added a missing step, x increases to 6. Similarly, x adjusts whenever steps are removed. To prevent an infinite loop of corrections, each block has a static upper limit of 50 steps.