On the Planning Abilities of Large Language Models - A Critical Investigation

Authors: Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, Subbarao Kambhampati

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a systematic study by generating a suite of instances on domains similar to the ones employed in the International Planning Competition and evaluate LLMs in two distinct modes: autonomous and heuristic. Our findings reveal that LLMs ability to generate executable plans autonomously is rather limited, with the best model (GPT-4) having an average success rate of 12% across the domains.
Researcher Affiliation Academia Karthik Valmeekam School of Computing & AI Arizona State University Tempe. kvalmeek@asu.edu Matthew Marquez School of Computing & AI Arizona State University, Tempe. mmarqu22@asu.edu Sarath Sreedharan Department of Computer Science, Colorado State University, Fort Collins. sarath.sreedharan@colostate.edu Subbarao Kambhampati School of Computing & AI Arizona State University, Tempe. rao@asu.edu
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes 2Link to the github repo: https://github.com/karthikv792/LLMs-Planning
Open Datasets Yes To investigate these questions in a systematic rather than anecdotal manner, we generate a suite of planning problem instances 2 based on the kinds of domains employed in the International Planning Competition [14]. ... We prepared a dataset comprising the initial state, goal state, and the respective plan for 1,000 distinct Blocksworld instances. It s important to note that these instances were separate from our test set of 600 instances.
Dataset Splits Yes By using the default hyperparameters provided by Open AI and an 80-20 train-validation data split, we carried out the fine-tuning process.
Hardware Specification No No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments were provided in the paper.
Software Dependencies No The paper mentions software tools like GPT-4, GPT-3.5, LPG, and VAL, but does not provide specific version numbers for these or any other ancillary software components used for the experiments.
Experiment Setup Yes We set the temperature for all models to be 0, thereby making them deterministic. By using the default hyperparameters provided by Open AI and an 80-20 train-validation data split, we carried out the fine-tuning process.