Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hierarchical Planning for Complex Tasks with Knowledge Graph-RAG and Symbolic Verification

Authors: Flavio Petruzzellis, Cristina Cornelio, Pietro Lio

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluation against baseline methods demonstrates the consistent significant advantages of integrating hierarchical planning, symbolic verification, and RAG across tasks of varying complexity and different LLMs. Additionally, our experimental setup and novel metrics not only validate our approach for complex planning but also serve as a tool for assessing LLMs reasoning and compositional capabilities.
Researcher Affiliation Collaboration 1Department of Mathematics, University of Padova, Padova, Italy 2Samsung AI, Cambridge, UK 3Computer Science Department, University of Cambridge, Cambridge, UK. Correspondence to: Cristina Cornelio <EMAIL>.
Pseudocode No The paper describes methods and processes through textual descriptions and diagrams (Figure 1, Figure 2, Figure 3) rather than formal pseudocode or algorithm blocks.
Open Source Code Yes Code available at https://github. com/corneliocristina/HVR .
Open Datasets Yes We employed Onto Thor (Cornelio & Diab, 2024), which describes the AI2Thor kitchen environment, as our ontology.
Dataset Splits No The paper categorizes tasks by complexity (moderate and high) and states that tasks for T12 and T5bis have variable ground truth plans, but it does not specify explicit training, validation, or test dataset splits in terms of percentages, counts, or predefined partition files for any models or experiments.
Hardware Specification No The paper mentions using specific LLMs like Phi-3-mini-4k-instruct and gemini-1.5-flash (and gemini-2.0-flash in the appendix) but does not provide details about the underlying hardware (e.g., GPU models, CPU types, or memory specifications) on which these models were run or evaluated.
Software Dependencies No The paper mentions using AI2Thor, OntoThor, Phi-3-mini-4k-instruct, gemini-1.5-flash, PDDL, and a Python-based validator. However, it does not specify version numbers for Python or any of the libraries/frameworks used for implementation, which is necessary for reproducible software dependencies.
Experiment Setup Yes To generate macro actions, we employ a two-shot prompt template that incorporates the goal-oriented task description and the list of relevant objects with their properties, retrieved by the KG-RAG approach. A frozen LLM is fed with this prompt and outputs an ordered list of macro actions in natural language. ... To ensure consistent results, we employ in-context learning, providing two examples of plans for simple macro actions. ... When correcting an AA-block, for each step, the system attempts correction up to 2 x times, where x, the lenght of the action block dynamically updated based on the current number of steps. For example, starting with 5 steps (x = 5), if a correction added a missing step, x increases to 6. Similarly, x adjusts whenever steps are removed. To prevent an infinite loop of corrections, each block has a static upper limit of 50 steps.