Structured Attention Networks
Authors: Yoon Kim, Carl Denton, Luong Hoang, Alexander M. Rush
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment with two different classes of structured attention networks: a linear-chain conditional random field and a graph-based parsing model, and describe how these models can be practically implemented as neural network layers. Experiments show that this approach is effective for incorporating structural biases, and structured attention networks outperform baseline attention models on a variety of synthetic and real tasks: tree transduction, neural machine translation, question answering, and natural language inference. |
| Researcher Affiliation | Academia | Yoon Kim Carl Denton Luong Hoang Alexander M. Rush {yoonkim@seas,carldenton@college,lhoang@g,srush@seas}.harvard.edu School of Engineering and Applied Sciences Harvard University Cambridge, MA 02138, USA |
| Pseudocode | Yes | procedure FORWARDBACKWARD(θ) and procedure BACKPROPFORWARDBACKWARD(θ, p, L p ) in Figure 2. procedure INSIDEOUTSIDE(θ) in Figure 6. procedure BACKPROPINSIDEOUTSIDE(θ, p, L p ) in Figure 7. |
| Open Source Code | Yes | All code is available at http://github.com/harvardnlp/struct-attn. |
| Open Datasets | Yes | The data comes from the Workshop on Asian Translation (WAT) (Nakazawa et al., 2016). We randomly pick 500K sentences from the original training set (of 3M sentences) where the Japanese sentence was at most 50 characters and the English sentence was at most 50 words. We apply the same length filter on the provided validation/test sets for evaluation. |
| Dataset Splits | Yes | We randomly pick 500K sentences from the original training set (of 3M sentences) where the Japanese sentence was at most 50 characters and the English sentence was at most 50 words. We apply the same length filter on the provided validation/test sets for evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions tools like Ky Tea toolkit (Neubig et al., 2011), Moses, TensorFlow models/syntaxnet, and Glo Ve embeddings (Pennington et al., 2014) but does not provide specific version numbers for these software dependencies or other libraries/packages. |
| Experiment Setup | Yes | Additional training details include: batch size of 20; training for 13 epochs with a learning rate of 1.0, which starts decaying by half after epoch 9 (or the epoch at which performance does not improve on validation, whichever comes first); parameter initialization over a uniform distribution U[ 0.1, 0.1]; gradient normalization at 1 (i.e. renormalize the gradients to have norm 1 if the l2 norm exceeds 1). Decoding is done with beam search (beam size = 5). |