reproducibilityindex.ai

On the Planning Abilities of Large Language Models - A Critical Investigation

Authors: Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, Subbarao Kambhampati

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a systematic study by generating a suite of instances on domains similar to the ones employed in the International Planning Competition and evaluate LLMs in two distinct modes: autonomous and heuristic. Our findings reveal that LLMs ability to generate executable plans autonomously is rather limited, with the best model (GPT-4) having an average success rate of 12% across the domains.
Researcher Affiliation	Academia	Karthik Valmeekam School of Computing & AI Arizona State University Tempe. kvalmeek@asu.edu Matthew Marquez School of Computing & AI Arizona State University, Tempe. mmarqu22@asu.edu Sarath Sreedharan Department of Computer Science, Colorado State University, Fort Collins. sarath.sreedharan@colostate.edu Subbarao Kambhampati School of Computing & AI Arizona State University, Tempe. rao@asu.edu
Pseudocode	No	No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	2Link to the github repo: https://github.com/karthikv792/LLMs-Planning
Open Datasets	Yes	To investigate these questions in a systematic rather than anecdotal manner, we generate a suite of planning problem instances 2 based on the kinds of domains employed in the International Planning Competition [14]. ... We prepared a dataset comprising the initial state, goal state, and the respective plan for 1,000 distinct Blocksworld instances. It s important to note that these instances were separate from our test set of 600 instances.
Dataset Splits	Yes	By using the default hyperparameters provided by Open AI and an 80-20 train-validation data split, we carried out the fine-tuning process.
Hardware Specification	No	No specific hardware details (e.g., exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments were provided in the paper.
Software Dependencies	No	The paper mentions software tools like GPT-4, GPT-3.5, LPG, and VAL, but does not provide specific version numbers for these or any other ancillary software components used for the experiments.
Experiment Setup	Yes	We set the temperature for all models to be 0, thereby making them deterministic. By using the default hyperparameters provided by Open AI and an 80-20 train-validation data split, we carried out the fine-tuning process.