Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Large Language Models as Commonsense Knowledge for Large-Scale Task Planning
Authors: Zirui Zhao, Wee Sun Lee, David Hsu
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that LLM-MCTS outperforms both MCTS alone and policies induced by LLMs (GPT2 and GPT3.5) by a wide margin for complex, novel tasks. Further experiments and analyses on multiple tasks multiplication, travel planning, object rearrangement suggest minimum description length (MDL) as a general guiding principle: if the description length of the world model is substantially smaller than that of the policy, using LLM as a world model for model-based planning is likely better than using LLM solely as a policy. |
| Researcher Affiliation | Academia | Zirui Zhao Wee Sun Lee David Hsu National University of Singapore EMAIL |
| Pseudocode | Yes | Algorithm 1 LLM-MCTS |
| Open Source Code | Yes | 1The code and supplementary materials are available at https://llm-mcts.github.io. |
| Open Datasets | Yes | We evaluate LLM-MCTS in Virtual Home [24], a standard household activity simulation platform widely used in earlier work [18, 22, 25, 33]. |
| Dataset Splits | No | To generate data for prompting and baseline training, we follow [25] to create 2000 tasks with randomly initialized scenes and expert trajectories. ... We also generated 800 tasks in total for evaluation. |
| Hardware Specification | Yes | CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (72 cores) GPU: NVIDIA Ge Force RTX 2080 Ti |
| Software Dependencies | No | The paper mentions software like GPT2, GPT3.5, and Sentence-BERT, but it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We used 100 times simulation during the tree search for GPT3.5-MCTS in our experiments. For UCT, we bound the runtime by 120 seconds. |