reproducibilityindex.ai

Staircase Attention for Recurrent Processing of Sequences

Authors: Da JU, Stephen Roller, Sainbayar Sukhbaatar, Jason E Weston

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work we introduce a novel attention procedure called staircase attention that, unlike self-attention, operates across the sequence (in time) recurrently processing the input by adding another step of processing... Staircase attention is shown to be able to solve tasks that involve tracking that conventional Transformers cannot, due to this recurrence. Further, it is shown to provide improved modeling power for the same size model (number of parameters) compared to self-attentive Transformers on large language modeling and dialogue tasks, yielding signiﬁcant perplexity gains. We show on two tasks requiring state-tracking that Staircase models can perform successfully where Transformers fail. We then show on two language modeling and a dialogue modeling task for the same number of parameters, signiﬁcantly lower perplexities can be obtained compared to standard Transformers for certain kinds of Staircase models.
Researcher Affiliation	Industry	Da Ju Meta AI daju@fb.com Stephen Roller Meta AI roller@fb.com Sainbayar Sukhbaatar Meta AI sainbar@fb.com Jason Weston Meta AI jase@meta.com
Pseudocode	No	The paper describes its model and processes using text and diagrams (Figure 1), but does not provide any pseudocode or algorithm blocks.
Open Source Code	Yes	Our code will be made publicly available. on Git Hub2. 2https://github.com/facebookresearch/transformer-sequential/
Open Datasets	Yes	Enwik8 Enwik8 is a character-level language modeling task [26] that consists of 100M tokens taken from Wikipedia articles. ... Pushshift.io Reddit We use a variant of Reddit discussions... extracted and obtained by a third party and made available on pushshift.io [3]... BASE Data We use the language modeling dataset from Lewis et al. [23], which consists of approximately 100B tokens, combining the corpora used in Liu et al. [24] that consists of Wikipedia, Book Corpus, CC-News, Open Web Tex and Stories, along with the English subset of the CC100 corpus [8].
Dataset Splits	No	For Random Walk and Algorithm tasks, we trained each model with multiple seeds and chose the best seed as measured by their validation performance. ... Valid (error %) ... Valid (bpb) ... Valid (ppl) (in Figure 2 and Table 2 titles). The paper mentions the use of a validation set and reports performance on it, but it does not specify the exact split percentages or sample counts for the validation sets for any of the datasets used.
Hardware Specification	No	The paper does not specify the hardware used for experiments, such as particular GPU or CPU models. It only mentions 'modern hardware' in a general sense.
Software Dependencies	No	The paper does not list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup	No	See Appendix A for further details of our experimental setup for training, including all hyperparameter choices. The paper states that detailed experimental setup information, including hyperparameters, is available in Appendix A, but this information is not present in the main text of the paper.