reproducibilityindex.ai

CogBench: a large language model walks into a psychology lab

Authors: Julian Coda-Forno, Marcel Binz, Jane X Wang, Eric Schulz

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper introduces Cog Bench, a benchmark that includes ten behavioral metrics derived from seven cognitive psychology experiments. This novel approach offers a toolkit for phenotyping LLMs behavior. We apply Cog Bench to 40 LLMs, yielding a rich and diverse dataset. We analyze this data using statistical multilevel modeling techniques, accounting for the nested dependencies among fine-tuned versions of specific LLMs. Our study highlights the crucial role of model size and reinforcement learning from human feedback (RLHF) in improving performance and aligning with human behavior.
Researcher Affiliation	Collaboration	1Computational Principles of Intelligence Lab, Max Planck Institute for Biological Cybernetics, T ubingen, Germany 2Institute for Human-Centered AI, Helmholtz Computational Health Center, Munich, Germany 3Google Deep Mind, London, UK.
Pseudocode	Yes	1 import statsmodels.formula.api as smf 2 model = mixedlm(f"{score} {'+'.join(llm_features)}", df_standardized, groups= df_standardized['type_of_llm']) Listing 1. Python multi-level regression code
Open Source Code	Yes	Taken together, our experiments show how psychology can offer detailed insights into artificial agents behavior as we provide an openly accessible1 and challenging benchmark to evaluate LLMs. 1https://github.com/juliancodaforno/Cog Bench
Open Datasets	Yes	We obtained the human data directly from the authors for most experiments (Dasgupta et al., 2020; Wilson et al., 2014; Ershadmanesh et al., 2023; Lefebvre et al., 2017; Kool et al., 2017) except for BART and temporal discounting.
Dataset Splits	No	The paper does not explicitly mention train/test/validation splits for its experiments with LLMs. It discusses 'training data' for LLMs themselves, but not for its own experimental setup.
Hardware Specification	No	The paper does not explicitly state the hardware specifications (e.g., specific GPU models, CPU types, or cloud instances) used for running its experiments.
Software Dependencies	No	The paper mentions using 'statsmodels.formula.api package in Python' in Appendix G but does not specify the version numbers for Python or any other libraries like PyTorch, TensorFlow, or specific ML frameworks that would be critical for reproducibility.
Experiment Setup	Yes	It is important to note that all experiments performed in this paper rely entirely on the LLMs in-context learning abilities and do not involve any form of fine-tuning. We set the temperature parameter to zero, leading to deterministic responses2, and retained the default values for all other parameters.