Chain of Thoughtlessness? An Analysis of CoT in Planning

Authors: Kaya Stechly, Karthik Valmeekam, Subbarao Kambhampati

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper presents a case study of chain of thought on problems from Blocksworld, a classical planning domain, and examines the performance of two state-of-the-art LLMs across two axes: generality of examples given in prompt, and complexity of problems queried with each prompt.
Researcher Affiliation Academia Kaya Stechly SCAI, Arizona State University kstechl@asu.edu Karthik Valmeekam SCAI, Arizona State University kvalmeek@asu.edu Subbarao Kambhampati SCAI, Arizona State University rao@asu.edu
Pseudocode No The paper describes algorithmic procedures and provides examples of LLM prompts and responses, but it does not include formal structured pseudocode or algorithm blocks (e.g., a figure or section labeled 'Pseudocode' or 'Algorithm').
Open Source Code Yes Resources and source code for planning experiments can be found at https://github. com/karthikv792/cot-planning and for other domains at https://github. com/kstechly/ cot-scheduling
Open Datasets Yes We focus on Blocksworld, a simple commonsense domain widely recognized and utilized in International Planning Competitions [23]... We source our list of names from the U.S. Social Security Administration [1]...
Dataset Splits No The paper discusses testing on various problem instances and refers to LLMs' 'in-context learning' from provided examples, but it does not specify explicit training/validation/test splits of the data used for their experiments, nor does it use the term 'validation' for data splitting.
Hardware Specification No The authors state they used the OpenAI API and Anthropic API for their experiments, which implies cloud-based LLM services, but they do not specify any particular GPU, CPU, memory, or other hardware specifications of the machines running these experiments.
Software Dependencies No The paper mentions using specific LLM models (e.g., GPT-4, Claude-3-Opus) and validates plans with VAL [21], but it does not provide specific version numbers for any software libraries or dependencies used in their experimental setup.
Experiment Setup Yes We consider different chain of thought prompts... We sample 5 different reasoning paths (with temperature 0.7) and chose the most frequent plan breaking ties randomly.