reproducibilityindex.ai

Neuro-Symbolic Procedural Planning with Commonsense Prompting

Authors: Yujie Lu, Weixi Feng, Wanrong Zhu, Wenda Xu, Xin Eric Wang, Miguel Eckstein, William Yang Wang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Both automatic and human evaluations on Wiki How and Robot How show the superiority of PLAN on procedural planning without further training or manual exemplars.
Researcher Affiliation	Academia	1University of California, Santa Barbara, CA, USA {yujielu,weixifeng,wanrongzhu,wendaxu}@ucsb.edu {migueleckstein,wangwilliamyang}@ucsb.edu 2University of California, Santa Cruz, CA, USA xwang366@ucsc.edu
Pseudocode	Yes	Algorithm 1 Neuro-Symbolic Procedural Planning using Commonsense-Infused Prompting
Open Source Code	Yes	Source code and datasets are publicly available at https://sites.google.com/view/iclr-clap ... We provide our code implementation at https://anonymous.4open.science/r/PLANNER-7B24 to reproduce our experiments.
Open Datasets	Yes	Datasets We conduct zero-shot experiments on two datasets with procedural information, Wiki How (collected following (Koupaee & Wang, 2018)) and Robot How (Puig et al., 2018) without training. ... Source code and datasets are publicly available at https://sites.google.com/view/iclr-clap
Dataset Splits	No	We conduct zero-shot experiments on two datasets with procedural information...without training. ... We perform a hyperparameter search for all evaluated methods for the following hyperparameters. ... The configurations used in the experiments are: θ=0.7, 20 step horizon, 3 hops, 3 ratio of concepts to task length, cosine similarity threshold 0.4, θe=0.6 and k=10. The paper performs hyperparameter search but does not specify a separate validation dataset split used for this purpose, only the chosen configurations.
Hardware Specification	Yes	We use one single NVIDIA A100 GPU Server for all the experiments.
Software Dependencies	No	The paper mentions software like 'BART-large version', '1.5 billion parameter GPT-2 (aka gpt2-xl)', 'GPT3 (davinci)', 'sentence-transformers (Ro BERTa-large)', and 'Hugging Face'. However, it does not provide specific version numbers for the general software environment (e.g., Python, PyTorch, CUDA) or the mentioned libraries/models.
Experiment Setup	Yes	The configurations used in the experiments are: θ=0.7, 20 step horizon, 3 hops, 3 ratio of concepts to task length, cosine similarity threshold 0.4, θe=0.6 and k=10. We perform a hyperparameter search for all evaluated methods for the following hyperparameters. The confidence threshold θ, which terminate the generation when below it, is searched in {0, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8}. The steps horizon, which constrains the maximal number of procedural planning steps, is searched in {10, 20, 40}. The number of hops for retrieving the subgraph from the external knowledge base is searched in {1, 2, 3}. The ratio of maximal concepts to the length of the task name is searched in {1, 2, 3}. The cosine similarity threshold for keeping the task-specific concept is searched in {0.4, 0.6, 0.8}. The edge weight threshold θe is searched in {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}. The top-k task-specific nodes value is searched in {1, 5, 10, 15, 20, 25, 50, 100}.