Large Memory Layers with Product Keys

Authors: Guillaume Lample, Alexandre Sablayrolles, Marc'Aurelio Ranzato, Ludovic Denoyer, Herve Jegou

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We report results on large-scale experiments for transformer models equipped with a memory, followed by an ablation study that shows the impact of different memory components on the model performance and memory usage.
Researcher Affiliation Collaboration Facebook AI Research Sorbonne Universit es, UPMC Univ Paris 06, UMR 7606, LIP6
Pseudocode No The paper describes the memory design and key selection process in text and figures, but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes We release our code for reproducibility purposes.3 https://github.com/facebookresearch/XLM
Open Datasets Yes We therefore evaluate the benefit of our approach on a corpus that is 30 times larger and extracted from the public Common Crawl. The training set is composed of 28 billion words (140 GB of data) extracted from about 40 million English news articles indexed by Common Crawl corpora.
Dataset Splits Yes The validation and test sets are both composed of 5000 news articles removed from the training set.
Hardware Specification Yes We implement our models with Py Torch [35], and train them on 32 Volta GPUs.
Software Dependencies No The paper mentions software like PyTorch, fast BPE, and Moses toolkit, but does not specify their version numbers (e.g., 'We implement our models with Py Torch [35]').
Experiment Setup Yes We train our models with the Adam optimizer [25], with a learning rate of 2.5 10 4, with β1 = 0.9, β2 = 0.98, following the learning rate schedule of Vaswani et al. [44]. ... In our main experiments, we use H = 4 memory heads, we select k = 32 keys per head, and use |K| = 5122 memory slots.