Block-Recurrent Transformers

Authors: DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, Behnam Neyshabur

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our model out-performs a long-range Transformer XL baseline by a wide margin, while running twice as fast. We demonstrate its effectiveness on PG19 (books), arXiv papers, and GitHub source code. Our code has been released as open source [1].
Researcher Affiliation Collaboration De Lesley Hutchins 1, Imanol Schlag 3 , Yuhuai Wu1, Ethan Dyer2, Behnam Neyshabur2 1 Google Research 2 Google Research, Blueshift Team 3 The Swiss AI Lab IDSIA, SUPSI & USI {delesley, yuhuai, edyer, neyshabur}@google.com imanol@idsia.ch
Pseudocode No The paper contains figures illustrating the architecture and concepts (Figure 1, Figure 2), but no pseudocode or algorithm blocks.
Open Source Code No The abstract states: 'Our code has been released as open source [1].' However, the checklist explicitly contradicts this under section 2(a): 'Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We plan to open-source the code itself, although we have not yet done so.' Prioritizing the checklist's explicit statement, the code is not yet released.
Open Datasets Yes The PG19 dataset [42] contains full-length books written prior to 1919 from project Gutenberg. The arXiv dataset [11] is a corpus of technical papers downloaded via the arXiv Bulk Data Access. The Git Hub dataset [11] is a corpus of source code from different Git Hub repositories with open-source licenses. Our main results are for PG19, which is a publicly available dataset.
Dataset Splits No The paper mentions a 'fixed validation set of 1024 tokens' in Appendix C.2 and the size of the test set in Section 4.5. However, it does not provide explicit details about the training set size or how the entire dataset was partitioned into training, validation, and test splits, which is necessary for full reproducibility of the data partitioning.
Hardware Specification Yes All models were trained on Tensor Processing Units (TPUs) using TensorFlow. We use 64 TPU cores for all experiments, which gives us a total of 128GB of HBM memory.
Software Dependencies No The paper mentions 'TensorFlow', 'Sentence Piece vocabulary', and 'Adafactor' but does not provide specific version numbers for these software components.
Experiment Setup Yes We train all models for 1.3M steps using Adafactor [51] with a constant learning rate of 0.01 and a batch size of 256 for XL-style models and 32 for Slide models. We use a fixed validation set of 1024 tokens for all models. We stabilize training by initializing the weights and bias to small but non-zero values, and adding a constant -1 and +1 to the input and forget gates to bias them to remember.