Deterministic Attention for Sequence-to-Sequence Constituent Parsing
Authors: Chunpeng Ma, Lemao Liu, Akihiro Tamura, Tiejun Zhao, Eiichiro Sumita
AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When tested on the standard WSJ treebank, the deterministic attention model produced significant improvements over probabilistic attention models, delivering a 90.6 F-score on the test set, using ensembling but without requiring pre-training, tri-training, or POS-tagging. |
| Researcher Affiliation | Academia | 1Machine Intelligence and Translation Laboratory, Harbin Institute of Technology, Harbin, China 2ASTREC, National Institute of Information and Communications Technology (NICT), Kyoto, Japan |
| Pseudocode | Yes | Table 3: Formal representation of the bottom-up linearization method. σ can be empty. For sh action, i Xj may also be empty, in which case the stack should be 0XX1 after the sh action is implemented. |
| Open Source Code | No | We implemented the deterministic attention mechanism based on an open-source sequence-to-sequence toolkit nematus1. |
| Open Datasets | Yes | All experiments were conducted using the WSJ part of the Penn Treebank. Following previous studies such as that of Watanabe and Sumita (2015), we used Sections 2-21, 22 and 23 as the training set, development set and testing set, respectively. |
| Dataset Splits | Yes | Following previous studies such as that of Watanabe and Sumita (2015), we used Sections 2-21, 22 and 23 as the training set, development set and testing set, respectively. |
| Hardware Specification | No | The paper does not specify the hardware used for the experiments (e.g., CPU, GPU, memory, or specific computing infrastructure). |
| Software Dependencies | No | The paper mentions using |
| Experiment Setup | Yes | We used only one hidden layer with 256 units, set the word embedding dimension as 512, and used dropout for regularization, following the configuration of Vinyals et al. (2015). Pre-training was not implemented. Instead, the word embedding matrix and other network parameters were initialized randomly. For decoding, we used a beam search strategy with a fixed beam size of 10. |