reproducibilityindex.ai

Recurrent Memory Transformer

Authors: Aydar Bulatov, Yury Kuratov, Mikhail Burtsev

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We designed our experiments to evaluate the ability of Recurrent Memory Transformers to preserve long-term dependencies across multiple input segments. The first set of experiments includes copy, reverse, associative retrieval, and quadratic equations tasks. The second one addresses language modeling task for word-level on Wiki Text-103 (Merity et al., 2017) and for character-level on enwik8 (Mahoney, 2006). We compare Recurrent Memory Transformer with Transformer and Transformer-XL models.
Researcher Affiliation	Academia	Aydar Bulatov1 bulatov.as@phystech.edu Yuri Kuratov1,2 yurii.kuratov@phystech.edu Mikhail S. Burtsev1,2 burtcev.ms@mipt.ru 1Neural Networks and Deep Learning Lab, Moscow Institute of Physics and Technology, Dolgoprudny, Russia 2AIRI, Moscow, Russia
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	RMT code and experiments are available1. 1https://github.com/booydar/LM-RMT. The code, results of the raw experiments and hyperparameters are provided in the supplementary materials and on Git Hub.
Open Datasets	Yes	We use two standard benchmarks for language modeling: Wiki Text-103 and enwik8. Wiki Text-103 (Merity et al., 2017) is used for word-level language modeling and contains 103M words from English Wikipedia articles. Enwik8 (Mahoney, 2006) is used for character-level and consists of 108 first bytes of XML text dump of the English Wikipedia.
Dataset Splits	Yes	Train/valid/test split as in (Beltagy et al., 2020) and metric is F1.
Hardware Specification	Yes	We used different GPUs depending on the task: 1080Ti, V100, A100. We provide this information in Appendix A for each task.
Software Dependencies	No	The paper mentions using Adam optimizer and Hugging Face Transformers but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	Language modeling experiments follow the same model and training hyperparameters as Transformer-XL. Wiki Text-103 experiments use 16-layer Transformers (10 heads, 410 hidden size, 2100 intermediate FF), enwik8 12 layer Transformers (8 heads, 512 hidden size, 2048 intermediate FF). We used Adam optimizer Kingma and Ba (2015) with linear schedule learning rate starting from 0.00025 for 200,000 steps for Wiki Text-103 and 400,000 steps for enwik8.