Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Neuro-Symbolic Procedural Planning with Commonsense Prompting
Authors: Yujie Lu, Weixi Feng, Wanrong Zhu, Wenda Xu, Xin Eric Wang, Miguel Eckstein, William Yang Wang
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Both automatic and human evaluations on Wiki How and Robot How show the superiority of PLAN on procedural planning without further training or manual exemplars. |
| Researcher Affiliation | Academia | 1University of California, Santa Barbara, CA, USA EMAIL EMAIL 2University of California, Santa Cruz, CA, USA EMAIL |
| Pseudocode | Yes | Algorithm 1 Neuro-Symbolic Procedural Planning using Commonsense-Infused Prompting |
| Open Source Code | Yes | Source code and datasets are publicly available at https://sites.google.com/view/iclr-clap ... We provide our code implementation at https://anonymous.4open.science/r/PLANNER-7B24 to reproduce our experiments. |
| Open Datasets | Yes | Datasets We conduct zero-shot experiments on two datasets with procedural information, Wiki How (collected following (Koupaee & Wang, 2018)) and Robot How (Puig et al., 2018) without training. ... Source code and datasets are publicly available at https://sites.google.com/view/iclr-clap |
| Dataset Splits | No | We conduct zero-shot experiments on two datasets with procedural information...without training. ... We perform a hyperparameter search for all evaluated methods for the following hyperparameters. ... The configurations used in the experiments are: θ=0.7, 20 step horizon, 3 hops, 3 ratio of concepts to task length, cosine similarity threshold 0.4, θe=0.6 and k=10. The paper performs hyperparameter search but does not specify a separate validation dataset split used for this purpose, only the chosen configurations. |
| Hardware Specification | Yes | We use one single NVIDIA A100 GPU Server for all the experiments. |
| Software Dependencies | No | The paper mentions software like 'BART-large version', '1.5 billion parameter GPT-2 (aka gpt2-xl)', 'GPT3 (davinci)', 'sentence-transformers (Ro BERTa-large)', and 'Hugging Face'. However, it does not provide specific version numbers for the general software environment (e.g., Python, PyTorch, CUDA) or the mentioned libraries/models. |
| Experiment Setup | Yes | The configurations used in the experiments are: θ=0.7, 20 step horizon, 3 hops, 3 ratio of concepts to task length, cosine similarity threshold 0.4, θe=0.6 and k=10. We perform a hyperparameter search for all evaluated methods for the following hyperparameters. The confidence threshold θ, which terminate the generation when below it, is searched in {0, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8}. The steps horizon, which constrains the maximal number of procedural planning steps, is searched in {10, 20, 40}. The number of hops for retrieving the subgraph from the external knowledge base is searched in {1, 2, 3}. The ratio of maximal concepts to the length of the task name is searched in {1, 2, 3}. The cosine similarity threshold for keeping the task-specific concept is searched in {0.4, 0.6, 0.8}. The edge weight threshold θe is searched in {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}. The top-k task-specific nodes value is searched in {1, 5, 10, 15, 20, 25, 50, 100}. |