Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Transition-Based Neural Word Segmentation Using Word-Level Features
Authors: Meishan Zhang, Yue Zhang, Guohong Fu
JAIR 2018 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on several benchmark datasets to thoroughly examine the effectiveness of neural word features. Results show the effectiveness of word and subword level features for neural Chinese word segmentation. With pretrained character and word embeddings, our method achieves state-of-the-art results. In addition, a combination of our neural features and the traditional discrete features results in further improved performance. We conduct a number of experimental analysis for deeper understanding our proposed neural model. |
| Researcher Affiliation | Academia | Meishan Zhang EMAIL School of Computer Science and Technology, Heilongjiang University, Harbin, China; Yue Zhang EMAIL Westlake University, Hangzhou, China; Guohong Fu EMAIL School of Computer Science and Technology, Heilongjiang University, Harbin, China. |
| Pseudocode | Yes | Algorithm 1 Beam-search decoding, where Θ is the set of all model parameters. function Decode(c1 cn, Θ) agenda { (φ (empty stack), c1 cn (queue), score=0.0) } for k in 1 n list { } for candidate in agenda new Apply(sep, candidate, ck, Θ) additem(list, new) new Apply( app, candidate, ck, Θ) additem(list, new) agenda Top-B(list, B) best Best Item(agenda) w1 wm Extract Words(best) |
| Open Source Code | Yes | We make our codes and models publicly available under GPL at https://github.com/zhangmeishan/NNTran Segmentor. |
| Open Datasets | Yes | We use three benchmark datasets for evaluation, namely CTB6, PKU and MSR. The CTB6 corpus is taken from the Penn Chinese Treebank 6.0, and the PKU and MSR corpora can be obtained from Bake Off2005 (Emerson, 2005). [...] The Chinese Gigaword corpus (LDC2011T13) is used to pretrain character and word embeddings. |
| Dataset Splits | Yes | We follow Zhang et al. (2014a), splitting the CTB6 corpus into training, development and testing sections. For the PKU and MSR corpora, only the training and test datasets are specified and we randomly split 10% of the training sections for development. [...] Table 3 shows the overall statistics of the four datasets. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running experiments, such as specific CPU or GPU models. |
| Software Dependencies | No | The paper mentions using the 'word2vec tool' (Mikolov et al., 2013) and 'extended word2vec tool' (Levy and Goldberg, 2014) but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | The hyper-parameter values are tuned according to preliminary results on the development corpus. We set the dimension size of the basic input character embeddings and word embeddings to 50. The dimension sizes of all the hidden layers of the neural model are set to 100. [...] The initial learning rate for Adagrad is set to 0.01, the regularization term in the training objective is set to 10 8, and the value of η in max-margin training is set to 0.2. [...] We train different models on the corresponding training datasets for 20 iterations, and select the best iteration model according to their development performances. |