Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On the Planning Abilities of Large Language Models - A Critical Investigation

Authors: Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, Subbarao Kambhampati

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a systematic study by generating a suite of instances on domains similar to the ones employed in the International Planning Competition and evaluate LLMs in two distinct modes: autonomous and heuristic. Our findings reveal that LLMs ability to generate executable plans autonomously is rather limited, with the best model (GPT-4) having an average success rate of 12% across the domains.
Researcher Affiliation Academia Karthik Valmeekam School of Computing & AI Arizona State University Tempe. EMAIL Matthew Marquez School of Computing & AI Arizona State University, Tempe. EMAIL Sarath Sreedharan Department of Computer Science, Colorado State University, Fort Collins. EMAIL Subbarao Kambhampati School of Computing & AI Arizona State University, Tempe. EMAIL
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes 2Link to the github repo: https://github.com/karthikv792/LLMs-Planning
Open Datasets Yes To investigate these questions in a systematic rather than anecdotal manner, we generate a suite of planning problem instances 2 based on the kinds of domains employed in the International Planning Competition [14]. ... We prepared a dataset comprising the initial state, goal state, and the respective plan for 1,000 distinct Blocksworld instances. It s important to note that these instances were separate from our test set of 600 instances.
Dataset Splits Yes By using the default hyperparameters provided by Open AI and an 80-20 train-validation data split, we carried out the fine-tuning process.
Hardware Specification No No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments were provided in the paper.
Software Dependencies No The paper mentions software tools like GPT-4, GPT-3.5, LPG, and VAL, but does not provide specific version numbers for these or any other ancillary software components used for the experiments.
Experiment Setup Yes We set the temperature for all models to be 0, thereby making them deterministic. By using the default hyperparameters provided by Open AI and an 80-20 train-validation data split, we carried out the fine-tuning process.