Memorization Capacity of Multi-Head Attention in Transformers

Authors: Sadegh Mahdavi, Renjie Liao, Christos Thrampoulidis

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our findings through experiments on synthetic data.
Researcher Affiliation Academia Sadegh Mahdavi1,2, Renjie Liao1,2, Christos Thrampoulidis1 1University of British Columbia 2Vector Institute for AI {smahdavi,rjliao,cthrampo}@ece.ubc.ca
Pseudocode No The paper describes mathematical proofs and experimental procedures, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/smahdavi4/attention-memorization
Open Datasets Yes We evaluate the mentioned models on 2000 images sampled from Image Net. To empirically test Assumption 2, we verify whether the context vectors are all linearly independent for each example. On the other hand, testing Assumption 1 is computationally difficult since computing Kruskal Rank is NP-Hard.
Dataset Splits No The paper mentions training parameters such as epochs, optimization steps, and batch size, but it does not specify explicit training/validation/test splits (e.g., percentages or sample counts) for the datasets used, especially for the synthetic data.
Hardware Specification Yes Running the experiments reported in our paper takes approximately 20 GPU days on an NVIDIA V100 GPU.
Software Dependencies No The paper mentions using an "Adam optimizer with a learning rate of 0.001, and a scheduler with linear warmup and Cosine decay" but does not specify version numbers for any software libraries, frameworks, or programming languages used (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup Yes We train each task for at least 500 epochs and at least 50, 000 optimization steps (whichever is larger) with a batch size of 256, using an Adam optimizer with a learning rate of 0.001, and a scheduler with linear warmup and Cosine decay.