Attention Approximates Sparse Distributed Memory

Authors: Trenton Bricken, Cengiz Pehlevan

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We confirm that these conditions are satisfied in pre-trained GPT2 Transformer models. We test it in pre-trained GPT2 Transformer models [3] (Section 3) and simulations (Appendix B.7). We use the Query-Key Normalized Transformer variant [22] to directly show that the relationship to SDM holds well. We then use original GPT2 models to help confirm this result and make it more general. We analyze the β coefficients learnt by the Query-Key Normalization Transformer Attention variant [22].
Researcher Affiliation Academia Trenton Bricken Systems, Synthetic and Quantitative Biology Harvard University trentonbricken@g.harvard.edu Cengiz Pehlevan Applied Mathematics Harvard University cpehlevan@seas.harvard.edu
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes The code for running these experiments, other analyses, and reproducing all figures is available at https: //github.com/trentbrick/attention-approximates-sdm.
Open Datasets Yes We test it in pre-trained GPT2 Transformer models [3] (Section 3) and simulations (Appendix B.7). We use the Query-Key Normalized Transformer variant [22] to directly show that the relationship to SDM holds well. We then use original GPT2 models to help confirm this result and make it more general. We analyze the β coefficients learnt by the Query-Key Normalization Transformer Attention variant [22]. (References [3] and [22] point to publicly recognized models and tasks.)
Dataset Splits No The paper mentions using pre-trained models and translation tasks but does not specify train/validation/test dataset splits for its own experiments or analysis.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments or simulations.
Software Dependencies No We also would like to thank the open source software contributors that helped make this research possible, including but not limited to: Numpy, Pandas, Scipy, Matplotlib, Py Torch, Hugging Face, and Anaconda.
Experiment Setup Yes We test it in pre-trained GPT2 Transformer models [3] (Section 3) and simulations (Appendix B.7). We test random and correlated patterns in an autoassociative retrieval task across different numbers of neurons and SDM variants (Appendix B.7). These variants include SDM implemented using simulated neurons and the Attention approximation with a fitted β.