On the Planning Abilities of Large Language Models - A Critical Investigation
Authors: Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, Subbarao Kambhampati
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a systematic study by generating a suite of instances on domains similar to the ones employed in the International Planning Competition and evaluate LLMs in two distinct modes: autonomous and heuristic. Our findings reveal that LLMs ability to generate executable plans autonomously is rather limited, with the best model (GPT-4) having an average success rate of 12% across the domains. |
| Researcher Affiliation | Academia | Karthik Valmeekam School of Computing & AI Arizona State University Tempe. kvalmeek@asu.edu Matthew Marquez School of Computing & AI Arizona State University, Tempe. mmarqu22@asu.edu Sarath Sreedharan Department of Computer Science, Colorado State University, Fort Collins. sarath.sreedharan@colostate.edu Subbarao Kambhampati School of Computing & AI Arizona State University, Tempe. rao@asu.edu |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | 2Link to the github repo: https://github.com/karthikv792/LLMs-Planning |
| Open Datasets | Yes | To investigate these questions in a systematic rather than anecdotal manner, we generate a suite of planning problem instances 2 based on the kinds of domains employed in the International Planning Competition [14]. ... We prepared a dataset comprising the initial state, goal state, and the respective plan for 1,000 distinct Blocksworld instances. It s important to note that these instances were separate from our test set of 600 instances. |
| Dataset Splits | Yes | By using the default hyperparameters provided by Open AI and an 80-20 train-validation data split, we carried out the fine-tuning process. |
| Hardware Specification | No | No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments were provided in the paper. |
| Software Dependencies | No | The paper mentions software tools like GPT-4, GPT-3.5, LPG, and VAL, but does not provide specific version numbers for these or any other ancillary software components used for the experiments. |
| Experiment Setup | Yes | We set the temperature for all models to be 0, thereby making them deterministic. By using the default hyperparameters provided by Open AI and an 80-20 train-validation data split, we carried out the fine-tuning process. |