reproducibilityindex.ai

Deep Semantic Role Labeling With Self-Attention

Authors: Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen, Xiaodong Shi

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our single model achieves F1 = 83.4 on the Co NLL-2005 shared task dataset and F1 = 82.7 on the Co NLL-2012 shared task dataset, which outperforms the previous state-of-the-art results by 1.8 and 1.0 F1 score respectively.
Researcher Affiliation	Collaboration	Zhixing Tan,1 Mingxuan Wang,2 Jun Xie,2 Yidong Chen,1 Xiaodong Shi1 1School of Information Science and Engineering, Xiamen University, Xiamen, China 2Mobile Internet Group, Tencent Technology Co., Ltd, Beijing, China playinf@stu.xmu.edu.cn, {xuanswang, stiffxie}@tencent.com, {ydchen, mandel}@xmu.edu.cn
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	1Our source code is available at https://github.com/XMUNLP/ Tagger
Open Datasets	Yes	We report our empirical studies of DEEPATT on the two commonly used datasets from the Co NLL-2005 shared task and the Co NLL-2012 shared task. The Co NLL-2005 dataset takes section 2-21 of the Wall Street Journal (WSJ) corpus as training set, and section 24 as development set. The test set consists of section 23 of the WSJ corpus as well as 3 sections from the Brown corpus (Carreras and M arquez 2005). The Co NLL-2012 dataset is extracted from the Onto Notes v5.0 corpus. The description and separation of training, development and test set can be found in Pardhan et al. (2013).
Dataset Splits	Yes	The Co NLL-2005 dataset takes section 2-21 of the Wall Street Journal (WSJ) corpus as training set, and section 24 as development set. The test set consists of section 23 of the WSJ corpus as well as 3 sections from the Brown corpus (Carreras and M arquez 2005).
Hardware Specification	Yes	the parsing speed is 50K tokens per second on a single Titan X GPU.
Software Dependencies	No	The paper mentions using Adadelta as an optimizer and GloVe for word embeddings but does not provide specific version numbers for software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., TensorFlow, PyTorch), or other libraries.
Experiment Setup	Yes	The dimension of word embeddings and predicate mask embeddings is set to 100 and the number of hidden layers is set to 10. We set the number of hidden units d to 200. The number of heads h is set to 8. We apply dropout (Srivastava et al. 2014) to prevent the networks from over-ﬁtting. Dropout layers are added before residual connections with a keep probability of 0.8. Dropout is also applied before the attention softmax layer and the feed-froward Re LU hidden layer, and the keep probabilities are set to 0.9. We also employ label smoothing technique (Szegedy et al. 2016) with a smoothing value of 0.1 during training. Parameter optimization is performed using stochastic gradient descent. We adopt Adadelta (Zeiler 2012) (ϵ = 10−6 and ρ = 0.95) as the optimizer. To avoid exploding gradients problem, we clip the norm of gradients with a predeﬁned threshold 1.0 (Pascanu et al. 2013). Each SGD contains a mini-batch of approximately 4096 tokens for the Co NLL-2005 dataset and 8192 tokens for the Co NLL-2012 dataset. The learning rate is initialized to 1.0. After training 400k steps, we halve the learning rate every 100K steps. We train all models for 600K steps.