Block-Recurrent Transformers
Authors: DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, Behnam Neyshabur
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our model out-performs a long-range Transformer XL baseline by a wide margin, while running twice as fast. We demonstrate its effectiveness on PG19 (books), arXiv papers, and GitHub source code. Our code has been released as open source [1]. |
| Researcher Affiliation | Collaboration | De Lesley Hutchins 1, Imanol Schlag 3 , Yuhuai Wu1, Ethan Dyer2, Behnam Neyshabur2 1 Google Research 2 Google Research, Blueshift Team 3 The Swiss AI Lab IDSIA, SUPSI & USI {delesley, yuhuai, edyer, neyshabur}@google.com imanol@idsia.ch |
| Pseudocode | No | The paper contains figures illustrating the architecture and concepts (Figure 1, Figure 2), but no pseudocode or algorithm blocks. |
| Open Source Code | No | The abstract states: 'Our code has been released as open source [1].' However, the checklist explicitly contradicts this under section 2(a): 'Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We plan to open-source the code itself, although we have not yet done so.' Prioritizing the checklist's explicit statement, the code is not yet released. |
| Open Datasets | Yes | The PG19 dataset [42] contains full-length books written prior to 1919 from project Gutenberg. The arXiv dataset [11] is a corpus of technical papers downloaded via the arXiv Bulk Data Access. The Git Hub dataset [11] is a corpus of source code from different Git Hub repositories with open-source licenses. Our main results are for PG19, which is a publicly available dataset. |
| Dataset Splits | No | The paper mentions a 'fixed validation set of 1024 tokens' in Appendix C.2 and the size of the test set in Section 4.5. However, it does not provide explicit details about the training set size or how the entire dataset was partitioned into training, validation, and test splits, which is necessary for full reproducibility of the data partitioning. |
| Hardware Specification | Yes | All models were trained on Tensor Processing Units (TPUs) using TensorFlow. We use 64 TPU cores for all experiments, which gives us a total of 128GB of HBM memory. |
| Software Dependencies | No | The paper mentions 'TensorFlow', 'Sentence Piece vocabulary', and 'Adafactor' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We train all models for 1.3M steps using Adafactor [51] with a constant learning rate of 0.01 and a batch size of 256 for XL-style models and 32 for Slide models. We use a fixed validation set of 1024 tokens for all models. We stabilize training by initializing the weights and bias to small but non-zero values, and adding a constant -1 and +1 to the input and forget gates to bias them to remember. |