reproducibilityindex.ai

Large Language Models as Commonsense Knowledge for Large-Scale Task Planning

Authors: Zirui Zhao, Wee Sun Lee, David Hsu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that LLM-MCTS outperforms both MCTS alone and policies induced by LLMs (GPT2 and GPT3.5) by a wide margin for complex, novel tasks. Further experiments and analyses on multiple tasks multiplication, travel planning, object rearrangement suggest minimum description length (MDL) as a general guiding principle: if the description length of the world model is substantially smaller than that of the policy, using LLM as a world model for model-based planning is likely better than using LLM solely as a policy.
Researcher Affiliation	Academia	Zirui Zhao Wee Sun Lee David Hsu National University of Singapore {ziruiz, leews, dyhsu}@comp.nus.edu.sg
Pseudocode	Yes	Algorithm 1 LLM-MCTS
Open Source Code	Yes	1The code and supplementary materials are available at https://llm-mcts.github.io.
Open Datasets	Yes	We evaluate LLM-MCTS in Virtual Home [24], a standard household activity simulation platform widely used in earlier work [18, 22, 25, 33].
Dataset Splits	No	To generate data for prompting and baseline training, we follow [25] to create 2000 tasks with randomly initialized scenes and expert trajectories. ... We also generated 800 tasks in total for evaluation.
Hardware Specification	Yes	CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (72 cores) GPU: NVIDIA Ge Force RTX 2080 Ti
Software Dependencies	No	The paper mentions software like GPT2, GPT3.5, and Sentence-BERT, but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	We used 100 times simulation during the tree search for GPT3.5-MCTS in our experiments. For UCT, we bound the runtime by 120 seconds.