Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Enhancing LLM Planning for Robotics Manipulation through Hierarchical Procedural Knowledge Graphs

Authors: Jiacong Zhou, Jiaxu Miao, xianyun wang, Jun Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that small scale LLMs (7B) enhanced by our HP-KG significantly improve the planning capabilities, which are stronger than 72B LLMs only. Encouragingly, our approach remains effective on the most powerful GPT-4o model.
Researcher Affiliation	Academia	1School of Computer Science, Hangzhou Dianzi University 2The School of Intelligence Science and Engineering, Harbin Institute of Technology, Shenzhen 3Pengcheng Laboratory
Pseudocode	Yes	B Procedural Graph Retrieval-Augmented Planning Algorithm The complete formulation of the retrieval algorithm is detailed in Algorithm 1.
Open Source Code	Yes	Our code and data will be publicly available3. 3https://anonymous.4open.science/r/HP-KG-68EE/
Open Datasets	Yes	In our work, we utilize two main data sources. The first is the Wiki How corpus [28]... The second is the BEHAVIOR-1K dataset [27]... We mainly evaluate our HP-KG on RLBench [29] and Blocks Arrange [70] tasks for robotics manipulation. Furthermore, we also evaluate our approach on Act Plan-1K [18] for LLM planning.
Dataset Splits	No	The paper evaluates on established benchmarks (RLBench, Blocks Arrange, Act Plan-1K) but does not specify how these datasets are further split into training, validation, and testing sets within their experimental setup for model training or tuning. It mentions using '20 trials per task' for RLBench and evaluating 'on Actplan-1K', indicating evaluation on predefined benchmark test sets.
Hardware Specification	Yes	We conduct our experiments on servers equipped with NVIDIA A6000 GPUs (48GB VRAM), with NVIDIA CUDA Toolkit version 11.8.
Software Dependencies	No	We conduct our experiments on servers equipped with NVIDIA A6000 GPUs (48GB VRAM), with NVIDIA CUDA Toolkit version 11.8. For inference time comparison, we deploy different-sized models (7B and 72B) using the vLLM inference framework and use the AWQ quantized version for the 72B model to fit within the available GPU memory.
Experiment Setup	Yes	In our Iterative Verification and Refinement process, the maximum iterations is set to 3. For all clustering operations, we employ cosine similarity with a threshold of 0.85. In our Procedural Graph Retrieval process, we set k1 to 100 and set k2 to 3 or 5 depending on the experimental conditions. For zero-shot manipulation approaches, we utilize GPT-4o as the primary planner unless explicitly stated in experiments and we set the target retrieval level Ltarget to the step procedure. For experiments on Actplan-1K, we set the target retrieval level Ltarget to the Task procedure due to its more complex task objectives.