reproducibilityindex.ai

Unsupervised Learning Helps Supervised Neural Word Segmentation

Authors: Xiaobin Wang, Deng Cai, Linlin Li, Guangwei Xu, Hai Zhao, Luo Si7200-7207

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on standard data sets show that the explored strategies indeed improve the recall rate of out-of-vocabulary words and thus boost the segmentation accuracy. Moreover, the model enhanced by the proposed methods outperforms state-of-the-art models in closed test and shows promising improvement trend when adopting three different strategies with the help of a large unlabeled data set. Our thorough empirical study eventually veriﬁes the proposed approach outperforms the widelyused pre-training approach in terms of effectively making use of freely abundant unlabeled data.
Researcher Affiliation	Collaboration	Xiaobin Wang,1 Deng Cai,2 Linlin Li,1 Guangwei Xu,1 Hai Zhao,3 Luo Si1 1Alibaba Group, 2The Chinese University of Hong Kong, 3Shanghai Jiao Tong University
Pseudocode	Yes	Algorithm 1 multi-task learning with unlabeled data
Open Source Code	No	The baseline model implementation is cloned from Github for the baseline segmenter 4. https://github.com/jcyk/greedy CWS. ... We used an open-source version of NPYLM based segmenter5 as the unsupervised segmenter, which generates segmented texts for the label embedding and multi-task learning approaches. https://github.com/musyoku/python-npylm. The paper provides links to third-party code used, but not its own implementation of the proposed methods.
Open Datasets	Yes	We evaluate the effectiveness of our methods by F1-score on the widely used benchmark datasets, i.e., PKU, MSR, AS and CITYU, from the 2nd international CWS Bakeoff (Bakeoff-2005) (Emerson 2005).
Dataset Splits	Yes	Table 3: Statistics of the dataset, number of sentences (#s) and words (#w). MSR PKU AS CITYU Train #s 78k 17k 638k 48k #w 2,122k 1,010k 4,904k 1,310k Dev #s 8.7k 1.9k 71k 5.3k #w 246k 100k 545k 146k Test #s 4.0k 1.9k 14k 1.4k #w 106k 104k 123k 41k
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud instance specifications).
Software Dependencies	No	The paper mentions using a baseline segmenter and an open-source NPYLM segmenter but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	Table 4: Hyper-parameters of the baseline model. Character embedding size 100 Word embedding size 50 Hidden unit number 50 Margin loss discount 0.2 Maximum word length 6 Decoding beam size 1