Attention Approximates Sparse Distributed Memory
Authors: Trenton Bricken, Cengiz Pehlevan
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We confirm that these conditions are satisfied in pre-trained GPT2 Transformer models. We test it in pre-trained GPT2 Transformer models [3] (Section 3) and simulations (Appendix B.7). We use the Query-Key Normalized Transformer variant [22] to directly show that the relationship to SDM holds well. We then use original GPT2 models to help confirm this result and make it more general. We analyze the β coefficients learnt by the Query-Key Normalization Transformer Attention variant [22]. |
| Researcher Affiliation | Academia | Trenton Bricken Systems, Synthetic and Quantitative Biology Harvard University trentonbricken@g.harvard.edu Cengiz Pehlevan Applied Mathematics Harvard University cpehlevan@seas.harvard.edu |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | The code for running these experiments, other analyses, and reproducing all figures is available at https: //github.com/trentbrick/attention-approximates-sdm. |
| Open Datasets | Yes | We test it in pre-trained GPT2 Transformer models [3] (Section 3) and simulations (Appendix B.7). We use the Query-Key Normalized Transformer variant [22] to directly show that the relationship to SDM holds well. We then use original GPT2 models to help confirm this result and make it more general. We analyze the β coefficients learnt by the Query-Key Normalization Transformer Attention variant [22]. (References [3] and [22] point to publicly recognized models and tasks.) |
| Dataset Splits | No | The paper mentions using pre-trained models and translation tasks but does not specify train/validation/test dataset splits for its own experiments or analysis. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments or simulations. |
| Software Dependencies | No | We also would like to thank the open source software contributors that helped make this research possible, including but not limited to: Numpy, Pandas, Scipy, Matplotlib, Py Torch, Hugging Face, and Anaconda. |
| Experiment Setup | Yes | We test it in pre-trained GPT2 Transformer models [3] (Section 3) and simulations (Appendix B.7). We test random and correlated patterns in an autoassociative retrieval task across different numbers of neurons and SDM variants (Appendix B.7). These variants include SDM implemented using simulated neurons and the Attention approximation with a fitted β. |