CogBench: a large language model walks into a psychology lab
Authors: Julian Coda-Forno, Marcel Binz, Jane X Wang, Eric Schulz
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper introduces Cog Bench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. This novel approach offers a toolkit for phenotyping LLMs behavior. We apply Cog Bench to 40 LLMs, yielding a rich and diverse dataset. We analyze this data using statistical multilevel modeling techniques, accounting for the nested dependencies among fine-tuned versions of specific LLMs. Our study highlights the crucial role of model size and reinforcement learning from human feedback (RLHF) in improving performance and aligning with human behavior. |
| Researcher Affiliation | Collaboration | 1Computational Principles of Intelligence Lab, Max Planck Institute for Biological Cybernetics, T ubingen, Germany 2Institute for Human-Centered AI, Helmholtz Computational Health Center, Munich, Germany 3Google Deep Mind, London, UK. |
| Pseudocode | Yes | 1 import statsmodels.formula.api as smf 2 model = mixedlm(f"{score} {'+'.join(llm_features)}", df_standardized, groups= df_standardized['type_of_llm']) Listing 1. Python multi-level regression code |
| Open Source Code | Yes | Taken together, our experiments show how psychology can offer detailed insights into artificial agents behavior as we provide an openly accessible1 and challenging benchmark to evaluate LLMs. 1https://github.com/juliancodaforno/Cog Bench |
| Open Datasets | Yes | We obtained the human data directly from the authors for most experiments (Dasgupta et al., 2020; Wilson et al., 2014; Ershadmanesh et al., 2023; Lefebvre et al., 2017; Kool et al., 2017) except for BART and temporal discounting. |
| Dataset Splits | No | The paper does not explicitly mention train/test/validation splits for its experiments with LLMs. It discusses 'training data' for LLMs themselves, but not for its own experimental setup. |
| Hardware Specification | No | The paper does not explicitly state the hardware specifications (e.g., specific GPU models, CPU types, or cloud instances) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'statsmodels.formula.api package in Python' in Appendix G but does not specify the version numbers for Python or any other libraries like PyTorch, TensorFlow, or specific ML frameworks that would be critical for reproducibility. |
| Experiment Setup | Yes | It is important to note that all experiments performed in this paper rely entirely on the LLMs in-context learning abilities and do not involve any form of fine-tuning. We set the temperature parameter to zero, leading to deterministic responses2, and retained the default values for all other parameters. |