Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Working Memory Capacity of ChatGPT: An Empirical Study
Authors: Dongyu Gong, Xingchen Wan, Dingmin Wang
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we systematically assess the working memory capacity of Chat GPT, a large language model developed by Open AI, by examining its performance in verbal and spatial n-back tasks under various conditions. Our experiments reveal that Chat GPT has a working memory capacity limit strikingly similar to that of humans. |
| Researcher Affiliation | Academia | Dongyu Gong1,2, Xingchen Wan1, Dingmin Wang1 1University of Oxford 2Yale University EMAIL, EMAIL, EMAIL |
| Pseudocode | No | No pseudocode or clearly labeled algorithm block was found in the paper. |
| Open Source Code | Yes | All code for our experiments can be accessed in this repository: https://github.com/Daniel-Gong/Chat GPT-WM. |
| Open Datasets | No | The paper describes generating data for experiments ('we generated 50 blocks of letter sequences'), but does not provide access information (link, DOI, etc.) to a publicly available or open dataset. |
| Dataset Splits | No | The paper does not provide specific training/test/validation dataset splits. It describes generating data for '50 blocks of tests' for each experiment, but this refers to experimental runs, not a partitioned dataset. |
| Hardware Specification | No | The paper states using APIs for LLMs ('prompted Chat GPT (using the Open AI API, model = gpt-3.5-turbo , temperature = 1, other parameters are set to default values) to complete the tasks...'), meaning the authors did not specify their own hardware used for computation. |
| Software Dependencies | No | The paper mentions LLM models used (e.g., 'gpt-3.5-turbo', 'Bloomz-7B'), but does not provide specific version numbers for ancillary software like programming languages, libraries, or frameworks (e.g., Python 3.x, PyTorch x.x). |
| Experiment Setup | Yes | We devised two categories of n-back tasks involving verbal and spatial working memory... and prompted Chat GPT (using the Open AI API, model = gpt-3.5-turbo , temperature = 1, other parameters are set to default values) to complete the tasks in a trial-by-trial manner. For both categories, we have a base version task and several variants derived from the base version further to test the model s performance under different conditions. For n = {1, 2, 3}, respectively, we generated 50 blocks of letter sequences... Each block contained a sequence of 24 letters, which are presented one at a time as user input to the API. We included 8 match trials and 16 nonmatch trials in each block. |