Chain of Thoughtlessness? An Analysis of CoT in Planning
Authors: Kaya Stechly, Karthik Valmeekam, Subbarao Kambhampati
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper presents a case study of chain of thought on problems from Blocksworld, a classical planning domain, and examines the performance of two state-of-the-art LLMs across two axes: generality of examples given in prompt, and complexity of problems queried with each prompt. |
| Researcher Affiliation | Academia | Kaya Stechly SCAI, Arizona State University kstechl@asu.edu Karthik Valmeekam SCAI, Arizona State University kvalmeek@asu.edu Subbarao Kambhampati SCAI, Arizona State University rao@asu.edu |
| Pseudocode | No | The paper describes algorithmic procedures and provides examples of LLM prompts and responses, but it does not include formal structured pseudocode or algorithm blocks (e.g., a figure or section labeled 'Pseudocode' or 'Algorithm'). |
| Open Source Code | Yes | Resources and source code for planning experiments can be found at https://github. com/karthikv792/cot-planning and for other domains at https://github. com/kstechly/ cot-scheduling |
| Open Datasets | Yes | We focus on Blocksworld, a simple commonsense domain widely recognized and utilized in International Planning Competitions [23]... We source our list of names from the U.S. Social Security Administration [1]... |
| Dataset Splits | No | The paper discusses testing on various problem instances and refers to LLMs' 'in-context learning' from provided examples, but it does not specify explicit training/validation/test splits of the data used for their experiments, nor does it use the term 'validation' for data splitting. |
| Hardware Specification | No | The authors state they used the OpenAI API and Anthropic API for their experiments, which implies cloud-based LLM services, but they do not specify any particular GPU, CPU, memory, or other hardware specifications of the machines running these experiments. |
| Software Dependencies | No | The paper mentions using specific LLM models (e.g., GPT-4, Claude-3-Opus) and validates plans with VAL [21], but it does not provide specific version numbers for any software libraries or dependencies used in their experimental setup. |
| Experiment Setup | Yes | We consider different chain of thought prompts... We sample 5 different reasoning paths (with temperature 0.7) and chose the most frequent plan breaking ties randomly. |