Large Language Models as Commonsense Knowledge for Large-Scale Task Planning
Authors: Zirui Zhao, Wee Sun Lee, David Hsu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that LLM-MCTS outperforms both MCTS alone and policies induced by LLMs (GPT2 and GPT3.5) by a wide margin for complex, novel tasks. Further experiments and analyses on multiple tasks multiplication, travel planning, object rearrangement suggest minimum description length (MDL) as a general guiding principle: if the description length of the world model is substantially smaller than that of the policy, using LLM as a world model for model-based planning is likely better than using LLM solely as a policy. |
| Researcher Affiliation | Academia | Zirui Zhao Wee Sun Lee David Hsu National University of Singapore {ziruiz, leews, dyhsu}@comp.nus.edu.sg |
| Pseudocode | Yes | Algorithm 1 LLM-MCTS |
| Open Source Code | Yes | 1The code and supplementary materials are available at https://llm-mcts.github.io. |
| Open Datasets | Yes | We evaluate LLM-MCTS in Virtual Home [24], a standard household activity simulation platform widely used in earlier work [18, 22, 25, 33]. |
| Dataset Splits | No | To generate data for prompting and baseline training, we follow [25] to create 2000 tasks with randomly initialized scenes and expert trajectories. ... We also generated 800 tasks in total for evaluation. |
| Hardware Specification | Yes | CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (72 cores) GPU: NVIDIA Ge Force RTX 2080 Ti |
| Software Dependencies | No | The paper mentions software like GPT2, GPT3.5, and Sentence-BERT, but it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We used 100 times simulation during the tree search for GPT3.5-MCTS in our experiments. For UCT, we bound the runtime by 120 seconds. |