Memorization Capacity of Multi-Head Attention in Transformers
Authors: Sadegh Mahdavi, Renjie Liao, Christos Thrampoulidis
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our findings through experiments on synthetic data. |
| Researcher Affiliation | Academia | Sadegh Mahdavi1,2, Renjie Liao1,2, Christos Thrampoulidis1 1University of British Columbia 2Vector Institute for AI {smahdavi,rjliao,cthrampo}@ece.ubc.ca |
| Pseudocode | No | The paper describes mathematical proofs and experimental procedures, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/smahdavi4/attention-memorization |
| Open Datasets | Yes | We evaluate the mentioned models on 2000 images sampled from Image Net. To empirically test Assumption 2, we verify whether the context vectors are all linearly independent for each example. On the other hand, testing Assumption 1 is computationally difficult since computing Kruskal Rank is NP-Hard. |
| Dataset Splits | No | The paper mentions training parameters such as epochs, optimization steps, and batch size, but it does not specify explicit training/validation/test splits (e.g., percentages or sample counts) for the datasets used, especially for the synthetic data. |
| Hardware Specification | Yes | Running the experiments reported in our paper takes approximately 20 GPU days on an NVIDIA V100 GPU. |
| Software Dependencies | No | The paper mentions using an "Adam optimizer with a learning rate of 0.001, and a scheduler with linear warmup and Cosine decay" but does not specify version numbers for any software libraries, frameworks, or programming languages used (e.g., PyTorch, TensorFlow, Python version). |
| Experiment Setup | Yes | We train each task for at least 500 epochs and at least 50, 000 optimization steps (whichever is larger) with a batch size of 256, using an Adam optimizer with a learning rate of 0.001, and a scheduler with linear warmup and Cosine decay. |