Unsupervised Learning Helps Supervised Neural Word Segmentation

Authors: Xiaobin Wang, Deng Cai, Linlin Li, Guangwei Xu, Hai Zhao, Luo Si7200-7207

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on standard data sets show that the explored strategies indeed improve the recall rate of out-of-vocabulary words and thus boost the segmentation accuracy. Moreover, the model enhanced by the proposed methods outperforms state-of-the-art models in closed test and shows promising improvement trend when adopting three different strategies with the help of a large unlabeled data set. Our thorough empirical study eventually verifies the proposed approach outperforms the widelyused pre-training approach in terms of effectively making use of freely abundant unlabeled data.
Researcher Affiliation Collaboration Xiaobin Wang,1 Deng Cai,2 Linlin Li,1 Guangwei Xu,1 Hai Zhao,3 Luo Si1 1Alibaba Group, 2The Chinese University of Hong Kong, 3Shanghai Jiao Tong University
Pseudocode Yes Algorithm 1 multi-task learning with unlabeled data
Open Source Code No The baseline model implementation is cloned from Github for the baseline segmenter 4. https://github.com/jcyk/greedy CWS. ... We used an open-source version of NPYLM based segmenter5 as the unsupervised segmenter, which generates segmented texts for the label embedding and multi-task learning approaches. https://github.com/musyoku/python-npylm. The paper provides links to third-party code used, but not its own implementation of the proposed methods.
Open Datasets Yes We evaluate the effectiveness of our methods by F1-score on the widely used benchmark datasets, i.e., PKU, MSR, AS and CITYU, from the 2nd international CWS Bakeoff (Bakeoff-2005) (Emerson 2005).
Dataset Splits Yes Table 3: Statistics of the dataset, number of sentences (#s) and words (#w). MSR PKU AS CITYU Train #s 78k 17k 638k 48k #w 2,122k 1,010k 4,904k 1,310k Dev #s 8.7k 1.9k 71k 5.3k #w 246k 100k 545k 146k Test #s 4.0k 1.9k 14k 1.4k #w 106k 104k 123k 41k
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud instance specifications).
Software Dependencies No The paper mentions using a baseline segmenter and an open-source NPYLM segmenter but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Table 4: Hyper-parameters of the baseline model. Character embedding size 100 Word embedding size 50 Hidden unit number 50 Margin loss discount 0.2 Maximum word length 6 Decoding beam size 1