Recurrent Memory Transformer
Authors: Aydar Bulatov, Yury Kuratov, Mikhail Burtsev
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We designed our experiments to evaluate the ability of Recurrent Memory Transformers to preserve long-term dependencies across multiple input segments. The first set of experiments includes copy, reverse, associative retrieval, and quadratic equations tasks. The second one addresses language modeling task for word-level on Wiki Text-103 (Merity et al., 2017) and for character-level on enwik8 (Mahoney, 2006). We compare Recurrent Memory Transformer with Transformer and Transformer-XL models. |
| Researcher Affiliation | Academia | Aydar Bulatov1 bulatov.as@phystech.edu Yuri Kuratov1,2 yurii.kuratov@phystech.edu Mikhail S. Burtsev1,2 burtcev.ms@mipt.ru 1Neural Networks and Deep Learning Lab, Moscow Institute of Physics and Technology, Dolgoprudny, Russia 2AIRI, Moscow, Russia |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | RMT code and experiments are available1. 1https://github.com/booydar/LM-RMT. The code, results of the raw experiments and hyperparameters are provided in the supplementary materials and on Git Hub. |
| Open Datasets | Yes | We use two standard benchmarks for language modeling: Wiki Text-103 and enwik8. Wiki Text-103 (Merity et al., 2017) is used for word-level language modeling and contains 103M words from English Wikipedia articles. Enwik8 (Mahoney, 2006) is used for character-level and consists of 108 first bytes of XML text dump of the English Wikipedia. |
| Dataset Splits | Yes | Train/valid/test split as in (Beltagy et al., 2020) and metric is F1. |
| Hardware Specification | Yes | We used different GPUs depending on the task: 1080Ti, V100, A100. We provide this information in Appendix A for each task. |
| Software Dependencies | No | The paper mentions using Adam optimizer and Hugging Face Transformers but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Language modeling experiments follow the same model and training hyperparameters as Transformer-XL. Wiki Text-103 experiments use 16-layer Transformers (10 heads, 410 hidden size, 2100 intermediate FF), enwik8 12 layer Transformers (8 heads, 512 hidden size, 2048 intermediate FF). We used Adam optimizer Kingma and Ba (2015) with linear schedule learning rate starting from 0.00025 for 200,000 steps for Wiki Text-103 and 400,000 steps for enwik8. |