Evaluating Cognitive Maps and Planning in Large Language Models with CogEval

Authors: Ida Momennejad, Hosein Hasanbeig, Felipe Vieira Frujeri, Hiteshi Sharma, Nebojsa Jojic, Hamid Palangi, Robert Ness, Jonathan Larson

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here we make two major contributions. First, we propose Cog Eval, a cognitive science-inspired protocol for the systematic evaluation of cognitive capacities in LLMs. The Cog Eval protocol can be followed for the evaluation of various abilities. Second, here we follow Cog Eval to systematically evaluate cognitive maps and planning ability across eight LLMs (Open AI GPT-4, GPT-3.5-turbo-175B, davinci-003-175B, Google Bard, Cohere-xlarge-52.4B, Anthropic Claude-1-52B, LLa MA-13B, and Alpaca-7B). We base our task prompts on human experiments, which offer both established construct validity for evaluating planning, and are absent from LLM training sets. We find that, while LLMs show apparent competence in a few planning tasks with simpler structures, systematic evaluation reveals striking failure modes in planning tasks, including hallucinations of invalid trajectories and falling in loops. These findings do not support the idea of emergent out-of-the-box planning ability in LLMs.
Researcher Affiliation Industry Ida Momennejad Microsoft Research New York, NY idamo Hosein Hasanbeig Microsoft Research New York, NY hosein.hasanbeig Felipe Vieira Frujeri Microsoft Redmond, WA felipe.frujeri Hiteshi Sharma Microsoft Redmond, WA hiteshi.sharma Robert Osazuwa Ness Microsoft Research Redmond, WA robertness Nebojsa Jojic Microsoft Research Redmond, WA jojic Hamid Palangi Microsoft Research Redmond, WA hpalangi Jonathan Larson Microsoft Research Redmond, WA jolarso @microsoft.com
Pseudocode No The paper describes a protocol and tasks but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes All prompts are available in the supplementary material and on https://github.com/cogeval/cogmaps.
Open Datasets No The paper evaluates existing LLMs using novel prompts they generated, rather than training a new model on a publicly available dataset with splits.
Dataset Splits No The paper evaluates existing LLMs and does not describe training a new model with specific train/validation/test dataset splits.
Hardware Specification No The paper evaluates LLMs via APIs (Azure Open AI API, nat.dev API) but does not specify the hardware (e.g., CPU, GPU models) used for their own experimental setup (e.g., running prompts, collecting data, performing statistical analysis).
Software Dependencies No The paper mentions using LLM APIs and logistic regression analysis but does not specify version numbers for any software dependencies like programming languages, libraries, or statistical packages used for their analysis.
Experiment Setup Yes We conducted planning experiments to systematically compare the performance of all LLMs across task conditions created with 3 factors of graph structure (6 graphs), domain (3 domains), and tasks (15 tasks) over 3 temperatures (0, 0.5, 1)). LLM responses were generated 30 times per task prompt and temperature for the three Open AI models studied in this work and once per task and temperature for other LLMs.