Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CogBench: a large language model walks into a psychology lab
Authors: Julian Coda-Forno, Marcel Binz, Jane X Wang, Eric Schulz
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper introduces Cog Bench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. This novel approach offers a toolkit for phenotyping LLMs behavior. We apply Cog Bench to 40 LLMs, yielding a rich and diverse dataset. We analyze this data using statistical multilevel modeling techniques, accounting for the nested dependencies among fine-tuned versions of specific LLMs. Our study highlights the crucial role of model size and reinforcement learning from human feedback (RLHF) in improving performance and aligning with human behavior. |
| Researcher Affiliation | Collaboration | 1Computational Principles of Intelligence Lab, Max Planck Institute for Biological Cybernetics, T ubingen, Germany 2Institute for Human-Centered AI, Helmholtz Computational Health Center, Munich, Germany 3Google Deep Mind, London, UK. |
| Pseudocode | Yes | 1 import statsmodels.formula.api as smf 2 model = mixedlm(f"{score} {'+'.join(llm_features)}", df_standardized, groups= df_standardized['type_of_llm']) Listing 1. Python multi-level regression code |
| Open Source Code | Yes | Taken together, our experiments show how psychology can offer detailed insights into artificial agents behavior as we provide an openly accessible1 and challenging benchmark to evaluate LLMs. 1https://github.com/juliancodaforno/Cog Bench |
| Open Datasets | Yes | We obtained the human data directly from the authors for most experiments (Dasgupta et al., 2020; Wilson et al., 2014; Ershadmanesh et al., 2023; Lefebvre et al., 2017; Kool et al., 2017) except for BART and temporal discounting. |
| Dataset Splits | No | The paper does not explicitly mention train/test/validation splits for its experiments with LLMs. It discusses 'training data' for LLMs themselves, but not for its own experimental setup. |
| Hardware Specification | No | The paper does not explicitly state the hardware specifications (e.g., specific GPU models, CPU types, or cloud instances) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'statsmodels.formula.api package in Python' in Appendix G but does not specify the version numbers for Python or any other libraries like PyTorch, TensorFlow, or specific ML frameworks that would be critical for reproducibility. |
| Experiment Setup | Yes | It is important to note that all experiments performed in this paper rely entirely on the LLMs in-context learning abilities and do not involve any form of fine-tuning. We set the temperature parameter to zero, leading to deterministic responses2, and retained the default values for all other parameters. |