Deep Semantic Role Labeling With Self-Attention
Authors: Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen, Xiaodong Shi
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our single model achieves F1 = 83.4 on the Co NLL-2005 shared task dataset and F1 = 82.7 on the Co NLL-2012 shared task dataset, which outperforms the previous state-of-the-art results by 1.8 and 1.0 F1 score respectively. |
| Researcher Affiliation | Collaboration | Zhixing Tan,1 Mingxuan Wang,2 Jun Xie,2 Yidong Chen,1 Xiaodong Shi1 1School of Information Science and Engineering, Xiamen University, Xiamen, China 2Mobile Internet Group, Tencent Technology Co., Ltd, Beijing, China playinf@stu.xmu.edu.cn, {xuanswang, stiffxie}@tencent.com, {ydchen, mandel}@xmu.edu.cn |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Our source code is available at https://github.com/XMUNLP/ Tagger |
| Open Datasets | Yes | We report our empirical studies of DEEPATT on the two commonly used datasets from the Co NLL-2005 shared task and the Co NLL-2012 shared task. The Co NLL-2005 dataset takes section 2-21 of the Wall Street Journal (WSJ) corpus as training set, and section 24 as development set. The test set consists of section 23 of the WSJ corpus as well as 3 sections from the Brown corpus (Carreras and M arquez 2005). The Co NLL-2012 dataset is extracted from the Onto Notes v5.0 corpus. The description and separation of training, development and test set can be found in Pardhan et al. (2013). |
| Dataset Splits | Yes | The Co NLL-2005 dataset takes section 2-21 of the Wall Street Journal (WSJ) corpus as training set, and section 24 as development set. The test set consists of section 23 of the WSJ corpus as well as 3 sections from the Brown corpus (Carreras and M arquez 2005). |
| Hardware Specification | Yes | the parsing speed is 50K tokens per second on a single Titan X GPU. |
| Software Dependencies | No | The paper mentions using Adadelta as an optimizer and GloVe for word embeddings but does not provide specific version numbers for software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., TensorFlow, PyTorch), or other libraries. |
| Experiment Setup | Yes | The dimension of word embeddings and predicate mask embeddings is set to 100 and the number of hidden layers is set to 10. We set the number of hidden units d to 200. The number of heads h is set to 8. We apply dropout (Srivastava et al. 2014) to prevent the networks from over-fitting. Dropout layers are added before residual connections with a keep probability of 0.8. Dropout is also applied before the attention softmax layer and the feed-froward Re LU hidden layer, and the keep probabilities are set to 0.9. We also employ label smoothing technique (Szegedy et al. 2016) with a smoothing value of 0.1 during training. Parameter optimization is performed using stochastic gradient descent. We adopt Adadelta (Zeiler 2012) (ϵ = 10−6 and ρ = 0.95) as the optimizer. To avoid exploding gradients problem, we clip the norm of gradients with a predefined threshold 1.0 (Pascanu et al. 2013). Each SGD contains a mini-batch of approximately 4096 tokens for the Co NLL-2005 dataset and 8192 tokens for the Co NLL-2012 dataset. The learning rate is initialized to 1.0. After training 400k steps, we halve the learning rate every 100K steps. We train all models for 600K steps. |