Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Large Language Models as Commonsense Knowledge for Large-Scale Task Planning

Authors: Zirui Zhao, Wee Sun Lee, David Hsu

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that LLM-MCTS outperforms both MCTS alone and policies induced by LLMs (GPT2 and GPT3.5) by a wide margin for complex, novel tasks. Further experiments and analyses on multiple tasks multiplication, travel planning, object rearrangement suggest minimum description length (MDL) as a general guiding principle: if the description length of the world model is substantially smaller than that of the policy, using LLM as a world model for model-based planning is likely better than using LLM solely as a policy.
Researcher Affiliation Academia Zirui Zhao Wee Sun Lee David Hsu National University of Singapore EMAIL
Pseudocode Yes Algorithm 1 LLM-MCTS
Open Source Code Yes 1The code and supplementary materials are available at https://llm-mcts.github.io.
Open Datasets Yes We evaluate LLM-MCTS in Virtual Home [24], a standard household activity simulation platform widely used in earlier work [18, 22, 25, 33].
Dataset Splits No To generate data for prompting and baseline training, we follow [25] to create 2000 tasks with randomly initialized scenes and expert trajectories. ... We also generated 800 tasks in total for evaluation.
Hardware Specification Yes CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (72 cores) GPU: NVIDIA Ge Force RTX 2080 Ti
Software Dependencies No The paper mentions software like GPT2, GPT3.5, and Sentence-BERT, but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes We used 100 times simulation during the tree search for GPT3.5-MCTS in our experiments. For UCT, we bound the runtime by 120 seconds.