Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On the Planning Abilities of Large Language Models - A Critical Investigation
Authors: Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, Subbarao Kambhampati
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a systematic study by generating a suite of instances on domains similar to the ones employed in the International Planning Competition and evaluate LLMs in two distinct modes: autonomous and heuristic. Our findings reveal that LLMs ability to generate executable plans autonomously is rather limited, with the best model (GPT-4) having an average success rate of 12% across the domains. |
| Researcher Affiliation | Academia | Karthik Valmeekam School of Computing & AI Arizona State University Tempe. EMAIL Matthew Marquez School of Computing & AI Arizona State University, Tempe. EMAIL Sarath Sreedharan Department of Computer Science, Colorado State University, Fort Collins. EMAIL Subbarao Kambhampati School of Computing & AI Arizona State University, Tempe. EMAIL |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | 2Link to the github repo: https://github.com/karthikv792/LLMs-Planning |
| Open Datasets | Yes | To investigate these questions in a systematic rather than anecdotal manner, we generate a suite of planning problem instances 2 based on the kinds of domains employed in the International Planning Competition [14]. ... We prepared a dataset comprising the initial state, goal state, and the respective plan for 1,000 distinct Blocksworld instances. It s important to note that these instances were separate from our test set of 600 instances. |
| Dataset Splits | Yes | By using the default hyperparameters provided by Open AI and an 80-20 train-validation data split, we carried out the fine-tuning process. |
| Hardware Specification | No | No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments were provided in the paper. |
| Software Dependencies | No | The paper mentions software tools like GPT-4, GPT-3.5, LPG, and VAL, but does not provide specific version numbers for these or any other ancillary software components used for the experiments. |
| Experiment Setup | Yes | We set the temperature for all models to be 0, thereby making them deterministic. By using the default hyperparameters provided by Open AI and an 80-20 train-validation data split, we carried out the fine-tuning process. |