reproducibilityindex.ai

Working Memory Capacity of ChatGPT: An Empirical Study

Authors: Dongyu Gong, Xingchen Wan, Dingmin Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we systematically assess the working memory capacity of Chat GPT, a large language model developed by Open AI, by examining its performance in verbal and spatial n-back tasks under various conditions. Our experiments reveal that Chat GPT has a working memory capacity limit strikingly similar to that of humans.
Researcher Affiliation	Academia	Dongyu Gong1,2, Xingchen Wan1, Dingmin Wang1 1University of Oxford 2Yale University dongyu.gong@yale.edu, xwan@robots.ox.ac.uk, dingmin.wang@cs.ox.ac.uk
Pseudocode	No	No pseudocode or clearly labeled algorithm block was found in the paper.
Open Source Code	Yes	All code for our experiments can be accessed in this repository: https://github.com/Daniel-Gong/Chat GPT-WM.
Open Datasets	No	The paper describes generating data for experiments ('we generated 50 blocks of letter sequences'), but does not provide access information (link, DOI, etc.) to a publicly available or open dataset.
Dataset Splits	No	The paper does not provide specific training/test/validation dataset splits. It describes generating data for '50 blocks of tests' for each experiment, but this refers to experimental runs, not a partitioned dataset.
Hardware Specification	No	The paper states using APIs for LLMs ('prompted Chat GPT (using the Open AI API, model = gpt-3.5-turbo , temperature = 1, other parameters are set to default values) to complete the tasks...'), meaning the authors did not specify their own hardware used for computation.
Software Dependencies	No	The paper mentions LLM models used (e.g., 'gpt-3.5-turbo', 'Bloomz-7B'), but does not provide specific version numbers for ancillary software like programming languages, libraries, or frameworks (e.g., Python 3.x, PyTorch x.x).
Experiment Setup	Yes	We devised two categories of n-back tasks involving verbal and spatial working memory... and prompted Chat GPT (using the Open AI API, model = gpt-3.5-turbo , temperature = 1, other parameters are set to default values) to complete the tasks in a trial-by-trial manner. For both categories, we have a base version task and several variants derived from the base version further to test the model s performance under different conditions. For n = {1, 2, 3}, respectively, we generated 50 blocks of letter sequences... Each block contained a sequence of 24 letters, which are presented one at a time as user input to the API. We included 8 match trials and 16 nonmatch trials in each block.