Word Segmentation for Chinese Novels
Authors: Likun Qiu, Yue Zhang
AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Results on five different novels show significantly improved accuracies, in particular for OOV words. We perform experiments on segmenting five differen-t novels. |
| Researcher Affiliation | Academia | Likun Qiu and Yue Zhang Singapore University of Technology and Design 20 Dover Drive, Singapore 138682 qiulikun@gmail.com, yue zhang@sutd.edu.sg |
| Pseudocode | Yes | Algorithm 1 shows pseudocode of the double propagation algorithm. |
| Open Source Code | No | To facilitate future comparisons, we release our annotated datasets at http://people.sutd.edu.sg/%7Eyue_zhang/publication.html. This link is for annotated datasets, not a statement of releasing the source code for the described methodology. |
| Open Datasets | Yes | The People s Daily Corpus, which contains all the articles of People s Daily in January 1998, is used as the source-domain annotated corpus Gen Corpus. To facilitate future comparisons, we release our annotated datasets at http://people.sutd.edu.sg/%7Eyue_zhang/publication.html. |
| Dataset Splits | Yes | We take 300 annotated sentences from Äl (ZX) as the development data (ZX-dev), which is used to determine the amount of sentences for a self-training baseline, and for the parameter tuning of the double propagation algorithm, the noun entity classifier, and the POS classifier. |
| Hardware Specification | No | No specific hardware details (such as GPU or CPU models, or memory specifications) are mentioned for the experimental setup. |
| Software Dependencies | Yes | We apply the joint segmentor and POS-tagger of Zhang and Clark (2010)1 as our baseline system. 1from http://sourceforge.net/zpar version 0.6, using the agenda implementation of chinese.postagger. |
| Experiment Setup | Yes | Beam-search is applied to find a highest-scored sequence of transitions heuristically. The system scores search candidates using a linear model, which is trained using the averaged perceptron (Collins 2002) and early-update (Collins and Roark 2004). We use the ZX development data to decide the best number of target-domain sentences from 4000 to 20000, and the best result is achieved using 8000 sentences. taking words with probabilities above a threshold α as novel-specific nouns; probability above a threshold β. |