CogBench: a large language model walks into a psychology lab

Authors: Julian Coda-Forno, Marcel Binz, Jane X Wang, Eric Schulz

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper introduces Cog Bench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. This novel approach offers a toolkit for phenotyping LLMs behavior. We apply Cog Bench to 40 LLMs, yielding a rich and diverse dataset. We analyze this data using statistical multilevel modeling techniques, accounting for the nested dependencies among fine-tuned versions of specific LLMs. Our study highlights the crucial role of model size and reinforcement learning from human feedback (RLHF) in improving performance and aligning with human behavior.
Researcher Affiliation Collaboration 1Computational Principles of Intelligence Lab, Max Planck Institute for Biological Cybernetics, T ubingen, Germany 2Institute for Human-Centered AI, Helmholtz Computational Health Center, Munich, Germany 3Google Deep Mind, London, UK.
Pseudocode Yes 1 import statsmodels.formula.api as smf 2 model = mixedlm(f"{score} {'+'.join(llm_features)}", df_standardized, groups= df_standardized['type_of_llm']) Listing 1. Python multi-level regression code
Open Source Code Yes Taken together, our experiments show how psychology can offer detailed insights into artificial agents behavior as we provide an openly accessible1 and challenging benchmark to evaluate LLMs. 1https://github.com/juliancodaforno/Cog Bench
Open Datasets Yes We obtained the human data directly from the authors for most experiments (Dasgupta et al., 2020; Wilson et al., 2014; Ershadmanesh et al., 2023; Lefebvre et al., 2017; Kool et al., 2017) except for BART and temporal discounting.
Dataset Splits No The paper does not explicitly mention train/test/validation splits for its experiments with LLMs. It discusses 'training data' for LLMs themselves, but not for its own experimental setup.
Hardware Specification No The paper does not explicitly state the hardware specifications (e.g., specific GPU models, CPU types, or cloud instances) used for running its experiments.
Software Dependencies No The paper mentions using 'statsmodels.formula.api package in Python' in Appendix G but does not specify the version numbers for Python or any other libraries like PyTorch, TensorFlow, or specific ML frameworks that would be critical for reproducibility.
Experiment Setup Yes It is important to note that all experiments performed in this paper rely entirely on the LLMs in-context learning abilities and do not involve any form of fine-tuning. We set the temperature parameter to zero, leading to deterministic responses2, and retained the default values for all other parameters.