reproducibilityindex.ai

Structured Attention Networks

Authors: Yoon Kim, Carl Denton, Luong Hoang, Alexander M. Rush

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment with two different classes of structured attention networks: a linear-chain conditional random ﬁeld and a graph-based parsing model, and describe how these models can be practically implemented as neural network layers. Experiments show that this approach is effective for incorporating structural biases, and structured attention networks outperform baseline attention models on a variety of synthetic and real tasks: tree transduction, neural machine translation, question answering, and natural language inference.
Researcher Affiliation	Academia	Yoon Kim Carl Denton Luong Hoang Alexander M. Rush {yoonkim@seas,carldenton@college,lhoang@g,srush@seas}.harvard.edu School of Engineering and Applied Sciences Harvard University Cambridge, MA 02138, USA
Pseudocode	Yes	procedure FORWARDBACKWARD(θ) and procedure BACKPROPFORWARDBACKWARD(θ, p, L p ) in Figure 2. procedure INSIDEOUTSIDE(θ) in Figure 6. procedure BACKPROPINSIDEOUTSIDE(θ, p, L p ) in Figure 7.
Open Source Code	Yes	All code is available at http://github.com/harvardnlp/struct-attn.
Open Datasets	Yes	The data comes from the Workshop on Asian Translation (WAT) (Nakazawa et al., 2016). We randomly pick 500K sentences from the original training set (of 3M sentences) where the Japanese sentence was at most 50 characters and the English sentence was at most 50 words. We apply the same length ﬁlter on the provided validation/test sets for evaluation.
Dataset Splits	Yes	We randomly pick 500K sentences from the original training set (of 3M sentences) where the Japanese sentence was at most 50 characters and the English sentence was at most 50 words. We apply the same length ﬁlter on the provided validation/test sets for evaluation.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions tools like Ky Tea toolkit (Neubig et al., 2011), Moses, TensorFlow models/syntaxnet, and Glo Ve embeddings (Pennington et al., 2014) but does not provide specific version numbers for these software dependencies or other libraries/packages.
Experiment Setup	Yes	Additional training details include: batch size of 20; training for 13 epochs with a learning rate of 1.0, which starts decaying by half after epoch 9 (or the epoch at which performance does not improve on validation, whichever comes ﬁrst); parameter initialization over a uniform distribution U[ 0.1, 0.1]; gradient normalization at 1 (i.e. renormalize the gradients to have norm 1 if the l2 norm exceeds 1). Decoding is done with beam search (beam size = 5).