Word Segmentation for Chinese Novels

Authors: Likun Qiu, Yue Zhang

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results on five different novels show significantly improved accuracies, in particular for OOV words. We perform experiments on segmenting five differen-t novels.
Researcher Affiliation Academia Likun Qiu and Yue Zhang Singapore University of Technology and Design 20 Dover Drive, Singapore 138682 qiulikun@gmail.com, yue zhang@sutd.edu.sg
Pseudocode Yes Algorithm 1 shows pseudocode of the double propagation algorithm.
Open Source Code No To facilitate future comparisons, we release our annotated datasets at http://people.sutd.edu.sg/%7Eyue_zhang/publication.html. This link is for annotated datasets, not a statement of releasing the source code for the described methodology.
Open Datasets Yes The People s Daily Corpus, which contains all the articles of People s Daily in January 1998, is used as the source-domain annotated corpus Gen Corpus. To facilitate future comparisons, we release our annotated datasets at http://people.sutd.edu.sg/%7Eyue_zhang/publication.html.
Dataset Splits Yes We take 300 annotated sentences from Äl (ZX) as the development data (ZX-dev), which is used to determine the amount of sentences for a self-training baseline, and for the parameter tuning of the double propagation algorithm, the noun entity classifier, and the POS classifier.
Hardware Specification No No specific hardware details (such as GPU or CPU models, or memory specifications) are mentioned for the experimental setup.
Software Dependencies Yes We apply the joint segmentor and POS-tagger of Zhang and Clark (2010)1 as our baseline system. 1from http://sourceforge.net/zpar version 0.6, using the agenda implementation of chinese.postagger.
Experiment Setup Yes Beam-search is applied to find a highest-scored sequence of transitions heuristically. The system scores search candidates using a linear model, which is trained using the averaged perceptron (Collins 2002) and early-update (Collins and Roark 2004). We use the ZX development data to decide the best number of target-domain sentences from 4000 to 20000, and the best result is achieved using 8000 sentences. taking words with probabilities above a threshold α as novel-specific nouns; probability above a threshold β.