Staircase Attention for Recurrent Processing of Sequences
Authors: Da JU, Stephen Roller, Sainbayar Sukhbaatar, Jason E Weston
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work we introduce a novel attention procedure called staircase attention that, unlike self-attention, operates across the sequence (in time) recurrently processing the input by adding another step of processing... Staircase attention is shown to be able to solve tasks that involve tracking that conventional Transformers cannot, due to this recurrence. Further, it is shown to provide improved modeling power for the same size model (number of parameters) compared to self-attentive Transformers on large language modeling and dialogue tasks, yielding significant perplexity gains. We show on two tasks requiring state-tracking that Staircase models can perform successfully where Transformers fail. We then show on two language modeling and a dialogue modeling task for the same number of parameters, significantly lower perplexities can be obtained compared to standard Transformers for certain kinds of Staircase models. |
| Researcher Affiliation | Industry | Da Ju Meta AI daju@fb.com Stephen Roller Meta AI roller@fb.com Sainbayar Sukhbaatar Meta AI sainbar@fb.com Jason Weston Meta AI jase@meta.com |
| Pseudocode | No | The paper describes its model and processes using text and diagrams (Figure 1), but does not provide any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code will be made publicly available. on Git Hub2. 2https://github.com/facebookresearch/transformer-sequential/ |
| Open Datasets | Yes | Enwik8 Enwik8 is a character-level language modeling task [26] that consists of 100M tokens taken from Wikipedia articles. ... Pushshift.io Reddit We use a variant of Reddit discussions... extracted and obtained by a third party and made available on pushshift.io [3]... BASE Data We use the language modeling dataset from Lewis et al. [23], which consists of approximately 100B tokens, combining the corpora used in Liu et al. [24] that consists of Wikipedia, Book Corpus, CC-News, Open Web Tex and Stories, along with the English subset of the CC100 corpus [8]. |
| Dataset Splits | No | For Random Walk and Algorithm tasks, we trained each model with multiple seeds and chose the best seed as measured by their validation performance. ... Valid (error %) ... Valid (bpb) ... Valid (ppl) (in Figure 2 and Table 2 titles). The paper mentions the use of a validation set and reports performance on it, but it does not specify the exact split percentages or sample counts for the validation sets for any of the datasets used. |
| Hardware Specification | No | The paper does not specify the hardware used for experiments, such as particular GPU or CPU models. It only mentions 'modern hardware' in a general sense. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x). |
| Experiment Setup | No | See Appendix A for further details of our experimental setup for training, including all hyperparameter choices. The paper states that detailed experimental setup information, including hyperparameters, is available in Appendix A, but this information is not present in the main text of the paper. |