reproducibilityindex.ai

Memorization Capacity of Multi-Head Attention in Transformers

Authors: Sadegh Mahdavi, Renjie Liao, Christos Thrampoulidis

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our findings through experiments on synthetic data.
Researcher Affiliation	Academia	Sadegh Mahdavi1,2, Renjie Liao1,2, Christos Thrampoulidis1 1University of British Columbia 2Vector Institute for AI {smahdavi,rjliao,cthrampo}@ece.ubc.ca
Pseudocode	No	The paper describes mathematical proofs and experimental procedures, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/smahdavi4/attention-memorization
Open Datasets	Yes	We evaluate the mentioned models on 2000 images sampled from Image Net. To empirically test Assumption 2, we verify whether the context vectors are all linearly independent for each example. On the other hand, testing Assumption 1 is computationally difficult since computing Kruskal Rank is NP-Hard.
Dataset Splits	No	The paper mentions training parameters such as epochs, optimization steps, and batch size, but it does not specify explicit training/validation/test splits (e.g., percentages or sample counts) for the datasets used, especially for the synthetic data.
Hardware Specification	Yes	Running the experiments reported in our paper takes approximately 20 GPU days on an NVIDIA V100 GPU.
Software Dependencies	No	The paper mentions using an "Adam optimizer with a learning rate of 0.001, and a scheduler with linear warmup and Cosine decay" but does not specify version numbers for any software libraries, frameworks, or programming languages used (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup	Yes	We train each task for at least 500 epochs and at least 50, 000 optimization steps (whichever is larger) with a batch size of 256, using an Adam optimizer with a learning rate of 0.001, and a scheduler with linear warmup and Cosine decay.