Structured Attention Networks

Authors: Yoon Kim, Carl Denton, Luong Hoang, Alexander M. Rush

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment with two different classes of structured attention networks: a linear-chain conditional random field and a graph-based parsing model, and describe how these models can be practically implemented as neural network layers. Experiments show that this approach is effective for incorporating structural biases, and structured attention networks outperform baseline attention models on a variety of synthetic and real tasks: tree transduction, neural machine translation, question answering, and natural language inference.
Researcher Affiliation Academia Yoon Kim Carl Denton Luong Hoang Alexander M. Rush {yoonkim@seas,carldenton@college,lhoang@g,srush@seas}.harvard.edu School of Engineering and Applied Sciences Harvard University Cambridge, MA 02138, USA
Pseudocode Yes procedure FORWARDBACKWARD(θ) and procedure BACKPROPFORWARDBACKWARD(θ, p, L p ) in Figure 2. procedure INSIDEOUTSIDE(θ) in Figure 6. procedure BACKPROPINSIDEOUTSIDE(θ, p, L p ) in Figure 7.
Open Source Code Yes All code is available at http://github.com/harvardnlp/struct-attn.
Open Datasets Yes The data comes from the Workshop on Asian Translation (WAT) (Nakazawa et al., 2016). We randomly pick 500K sentences from the original training set (of 3M sentences) where the Japanese sentence was at most 50 characters and the English sentence was at most 50 words. We apply the same length filter on the provided validation/test sets for evaluation.
Dataset Splits Yes We randomly pick 500K sentences from the original training set (of 3M sentences) where the Japanese sentence was at most 50 characters and the English sentence was at most 50 words. We apply the same length filter on the provided validation/test sets for evaluation.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions tools like Ky Tea toolkit (Neubig et al., 2011), Moses, TensorFlow models/syntaxnet, and Glo Ve embeddings (Pennington et al., 2014) but does not provide specific version numbers for these software dependencies or other libraries/packages.
Experiment Setup Yes Additional training details include: batch size of 20; training for 13 epochs with a learning rate of 1.0, which starts decaying by half after epoch 9 (or the epoch at which performance does not improve on validation, whichever comes first); parameter initialization over a uniform distribution U[ 0.1, 0.1]; gradient normalization at 1 (i.e. renormalize the gradients to have norm 1 if the l2 norm exceeds 1). Decoding is done with beam search (beam size = 5).