Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns

Authors: Brian DuSell, David Chiang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that transformer language models with nondeterministic stack attention learn CFLs very effectively, consistently outperforming baseline transformers, and outperforming even the previously-proposed VRNS-RNN (Du Sell & Chiang, 2022) on a CFL with maximal parsing difficulty (Greibach, 1973).
Researcher Affiliation Academia Brian Du Sell Department of Computer Science ETH Z urich brian.dusell@inf.ethz.ch David Chiang Department of Computer Science and Engineering University of Notre Dame dchiang@nd.edu
Pseudocode Yes Although Eq. (12) sums over an exponential number of runs, it can be computed in cubic time and quadratic space using Lang s dynamic programming algorithm (Lang, 1974), which can be expressed using the abstraction of Eq. (4). See Appendix B for details. ... Appendix B, IMPLEMENTATION DETAILS OF THE DVPDA: We maintain three main data structures: ... We initialize the tensors as follows. ... For 1 t n and 1 i t 1, ...
Open Source Code Yes Our code is publicly available.1 1https://github.com/bdusell/stack-attention
Open Datasets Yes We use the same sampling procedure as Du Sell & Chiang (2020) to generate datasets for each task. ... We use the Penn Treebank (Marcus et al., 1994) as preprocessed by Dyer et al. (2016). ... We use a subset of the German-English dataset from Europarl v7 (Koehn, 2005)...
Dataset Splits Yes Every time we train a model, we randomly sample a training set of 10k examples and a validation set of 1k examples, both with lengths in the range [40, 80]. ... All results are the best of 20 random restarts, selected by validation cross-entropy. ... All results are the best of 5 runs, selected by decoder cross-entropy on the validation set.
Hardware Specification Yes Table 6: Computational cost of training each architecture on the PTB language modeling task when run on an NVIDIA TITAN Xp GPU.
Software Dependencies No The paper mentions software like PyTorch and SentencePiece but does not specify their version numbers within the text. For example, "We use Py Torch s (Paszke et al., 2019) LSTM implementation" and "using the Sentence Piece (Kudo & Richardson, 2018) implementation of BPE (Sennrich et al., 2016)".
Experiment Setup Yes All transformers have 5 layers. For transformers with stack attention, in the third (middle) layer, we replace the SDPA sublayer with the corresponding stack attention sublayer. ... SDPA layers are causally masked and have 4 heads. For Tf, we use dmodel = 32. For Tf+Sup, we use dmodel = m = 32. For Tf+Nd, we use dmodel = 28, m = 5... We use a dropout rate of 0.1 throughout the transformer... We use minibatches of size 10... We optimize parameters with Adam... randomly sample the initial learning rate from a log-uniform distribution over [5 10 4, 1 10 2]. We clip gradients with a threshold of 5 using L2 norm rescaling.