Learning Associative Memories with Gradient Descent
Authors: Vivien Cabannes, Berfin Simsek, Alberto Bietti
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through theory and experiments, we provide several insights. ... We complement our analysis with experiments, investigating small multi-layer Transformer models with our associative memory viewpoint and identifying similar behaviors to those pinpointed in the simpler models. |
| Researcher Affiliation | Collaboration | 1Meta AI 2Flatiron. Correspondence to: <vivc@meta.com>. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository. |
| Open Datasets | No | We consider full-batch gradient descent on a dataset of 16 384 sequences of length 256 generated from the model described above with N = 64 tokens. ... The tokens following all non-trigger tokens are randomly sampled from a sequence-independent Markov model (namely, a character-level bigram model estimated from Shakespeare text data). The dataset is generated by the authors, and no specific link, DOI, or citation to a publicly available instance of this generated dataset is provided. |
| Dataset Splits | No | We consider full-batch gradient descent on a dataset of 16 384 sequences of length 256 generated from the model described above with N = 64 tokens. No explicit training, validation, or test dataset splits are provided. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory specifications) used to run the experiments. |
| Software Dependencies | No | The paper mentions 'pytorch convention' in Appendix A but does not list any specific software or library names with version numbers required for reproducibility. |
| Experiment Setup | Yes | We consider full-batch gradient descent on a dataset of 16 384 sequences of length 256 generated from the model described above with N = 64 tokens. ... Training losses are shown for different step-sizes η, and margins are shown for 5 different tokens. ... In Figure 6, we consider a setup with N = M = 5, f (x) = x, and p(x) 1/x, in different dimensions (with random embeddings). |