reproducibilityindex.ai

Linking In-context Learning in Transformers to Human Episodic Memory

Authors: Ji-An Li, Corey Zhou, Marcus Benna, Marcelo G Mattar

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that induction heads are behaviorally, functionally, and mechanistically similar to the contextual maintenance and retrieval (CMR) model of human episodic memory. Our analyses of LLMs pre-trained on extensive text data show that CMR-like heads often emerge in the intermediate and late layers, qualitatively mirroring human memory biases. The ablation of CMR-like heads suggests their causal role in in-context learning.
Researcher Affiliation	Academia	Li Ji-An Neurosciences Graduate Program University of California, San Diego jil095@ucsd.edu Corey Y Zhou Department of Cognitive Science University of California, San Diego yiz329@ucsd.edu Marcus K. Benna Department of Neurobiology University of California, San Diego mbenna@ucsd.edu Marcelo G. Mattar Department of Psychology New York University marcelo.mattar@nyu.edu
Pseudocode	No	The paper describes the mechanisms of Transformer models and the CMR model using text, equations, and diagrams (e.g., Figure 3 shows composition mechanisms), but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	All code is available at https://github.com/corxyz/icl-cmr.
Open Datasets	Yes	We first examined the induction behaviors of attention heads in the pre-trained GPT2-small model [16] using the Transformer Lens library [25]. We also observed similar phenomena in a different set of LLMs called Pythia (Fig. 6b), a family of models with shared architecture but different sizes, as well as three well-known models (Qwen-7B [32], Mistral-7B [33], Llama3-8B [34], Fig. 6c). We then computed the resultant ICL score on the sampled texts (2000 sequences, each with at least 512 tokens) from the processed version of Google s C4 dataset [35].
Dataset Splits	No	The paper discusses the training of models and their evaluation on test data, but it does not explicitly specify the use of a separate validation dataset split with percentages, sample counts, or a citation to predefined splits for its own experimental setup.
Hardware Specification	Yes	Table S2: Details of compute resources used to compute induction head metrics. All models were pretrained and accessible through the Transformer Lens library [25] with MIT License. The numbers in the Computing time column indicate the total number of minutes it took to compute all scores for all heads across all checkpoints where available. Transformer Model Type of compute worker RAM (GB) Storage (GB) Computing time (minutes) GPT2-small CPU 12.7 225.8 < 1 ... Pythia-12b-deduped-v0 TPU v2 334.6 225.3 205.
Software Dependencies	No	The paper mentions using the 'Transformer Lens library [25]' for its analysis. However, it does not provide specific version numbers for this library or other software components like the programming language (e.g., Python) or other relevant packages, which are necessary for reproducible software dependency information.
Experiment Setup	Yes	In essence, we optimized the parameters (βenc, βrec, γFT, τ 1) for each head to obtain a set of CMR-fitted scores that minimizes MSE. Specifically, we ablated either the top 10% CMR-like heads (i.e., top 10% heads with the smallest CMR distances) or the same number of randomly selected heads in each model.