Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns
Authors: Brian DuSell, David Chiang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that transformer language models with nondeterministic stack attention learn CFLs very effectively, consistently outperforming baseline transformers, and outperforming even the previously-proposed VRNS-RNN (Du Sell & Chiang, 2022) on a CFL with maximal parsing difficulty (Greibach, 1973). |
| Researcher Affiliation | Academia | Brian Du Sell Department of Computer Science ETH Z urich brian.dusell@inf.ethz.ch David Chiang Department of Computer Science and Engineering University of Notre Dame dchiang@nd.edu |
| Pseudocode | Yes | Although Eq. (12) sums over an exponential number of runs, it can be computed in cubic time and quadratic space using Lang s dynamic programming algorithm (Lang, 1974), which can be expressed using the abstraction of Eq. (4). See Appendix B for details. ... Appendix B, IMPLEMENTATION DETAILS OF THE DVPDA: We maintain three main data structures: ... We initialize the tensors as follows. ... For 1 t n and 1 i t 1, ... |
| Open Source Code | Yes | Our code is publicly available.1 1https://github.com/bdusell/stack-attention |
| Open Datasets | Yes | We use the same sampling procedure as Du Sell & Chiang (2020) to generate datasets for each task. ... We use the Penn Treebank (Marcus et al., 1994) as preprocessed by Dyer et al. (2016). ... We use a subset of the German-English dataset from Europarl v7 (Koehn, 2005)... |
| Dataset Splits | Yes | Every time we train a model, we randomly sample a training set of 10k examples and a validation set of 1k examples, both with lengths in the range [40, 80]. ... All results are the best of 20 random restarts, selected by validation cross-entropy. ... All results are the best of 5 runs, selected by decoder cross-entropy on the validation set. |
| Hardware Specification | Yes | Table 6: Computational cost of training each architecture on the PTB language modeling task when run on an NVIDIA TITAN Xp GPU. |
| Software Dependencies | No | The paper mentions software like PyTorch and SentencePiece but does not specify their version numbers within the text. For example, "We use Py Torch s (Paszke et al., 2019) LSTM implementation" and "using the Sentence Piece (Kudo & Richardson, 2018) implementation of BPE (Sennrich et al., 2016)". |
| Experiment Setup | Yes | All transformers have 5 layers. For transformers with stack attention, in the third (middle) layer, we replace the SDPA sublayer with the corresponding stack attention sublayer. ... SDPA layers are causally masked and have 4 heads. For Tf, we use dmodel = 32. For Tf+Sup, we use dmodel = m = 32. For Tf+Nd, we use dmodel = 28, m = 5... We use a dropout rate of 0.1 throughout the transformer... We use minibatches of size 10... We optimize parameters with Adam... randomly sample the initial learning rate from a log-uniform distribution over [5 10 4, 1 10 2]. We clip gradients with a threshold of 5 using L2 norm rescaling. |