Large Memory Layers with Product Keys
Authors: Guillaume Lample, Alexandre Sablayrolles, Marc'Aurelio Ranzato, Ludovic Denoyer, Herve Jegou
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We report results on large-scale experiments for transformer models equipped with a memory, followed by an ablation study that shows the impact of different memory components on the model performance and memory usage. |
| Researcher Affiliation | Collaboration | Facebook AI Research Sorbonne Universit es, UPMC Univ Paris 06, UMR 7606, LIP6 |
| Pseudocode | No | The paper describes the memory design and key selection process in text and figures, but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release our code for reproducibility purposes.3 https://github.com/facebookresearch/XLM |
| Open Datasets | Yes | We therefore evaluate the benefit of our approach on a corpus that is 30 times larger and extracted from the public Common Crawl. The training set is composed of 28 billion words (140 GB of data) extracted from about 40 million English news articles indexed by Common Crawl corpora. |
| Dataset Splits | Yes | The validation and test sets are both composed of 5000 news articles removed from the training set. |
| Hardware Specification | Yes | We implement our models with Py Torch [35], and train them on 32 Volta GPUs. |
| Software Dependencies | No | The paper mentions software like PyTorch, fast BPE, and Moses toolkit, but does not specify their version numbers (e.g., 'We implement our models with Py Torch [35]'). |
| Experiment Setup | Yes | We train our models with the Adam optimizer [25], with a learning rate of 2.5 10 4, with β1 = 0.9, β2 = 0.98, following the learning rate schedule of Vaswani et al. [44]. ... In our main experiments, we use H = 4 memory heads, we select k = 32 keys per head, and use |K| = 5122 memory slots. |