Unsupervised Learning Helps Supervised Neural Word Segmentation
Authors: Xiaobin Wang, Deng Cai, Linlin Li, Guangwei Xu, Hai Zhao, Luo Si7200-7207
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on standard data sets show that the explored strategies indeed improve the recall rate of out-of-vocabulary words and thus boost the segmentation accuracy. Moreover, the model enhanced by the proposed methods outperforms state-of-the-art models in closed test and shows promising improvement trend when adopting three different strategies with the help of a large unlabeled data set. Our thorough empirical study eventually veriļ¬es the proposed approach outperforms the widelyused pre-training approach in terms of effectively making use of freely abundant unlabeled data. |
| Researcher Affiliation | Collaboration | Xiaobin Wang,1 Deng Cai,2 Linlin Li,1 Guangwei Xu,1 Hai Zhao,3 Luo Si1 1Alibaba Group, 2The Chinese University of Hong Kong, 3Shanghai Jiao Tong University |
| Pseudocode | Yes | Algorithm 1 multi-task learning with unlabeled data |
| Open Source Code | No | The baseline model implementation is cloned from Github for the baseline segmenter 4. https://github.com/jcyk/greedy CWS. ... We used an open-source version of NPYLM based segmenter5 as the unsupervised segmenter, which generates segmented texts for the label embedding and multi-task learning approaches. https://github.com/musyoku/python-npylm. The paper provides links to third-party code used, but not its own implementation of the proposed methods. |
| Open Datasets | Yes | We evaluate the effectiveness of our methods by F1-score on the widely used benchmark datasets, i.e., PKU, MSR, AS and CITYU, from the 2nd international CWS Bakeoff (Bakeoff-2005) (Emerson 2005). |
| Dataset Splits | Yes | Table 3: Statistics of the dataset, number of sentences (#s) and words (#w). MSR PKU AS CITYU Train #s 78k 17k 638k 48k #w 2,122k 1,010k 4,904k 1,310k Dev #s 8.7k 1.9k 71k 5.3k #w 246k 100k 545k 146k Test #s 4.0k 1.9k 14k 1.4k #w 106k 104k 123k 41k |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud instance specifications). |
| Software Dependencies | No | The paper mentions using a baseline segmenter and an open-source NPYLM segmenter but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | Table 4: Hyper-parameters of the baseline model. Character embedding size 100 Word embedding size 50 Hidden unit number 50 Margin loss discount 0.2 Maximum word length 6 Decoding beam size 1 |