reproducibilityindex.ai

Word Segmentation for Chinese Novels

Authors: Likun Qiu, Yue Zhang

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Results on ﬁve different novels show signiﬁcantly improved accuracies, in particular for OOV words. We perform experiments on segmenting ﬁve differen-t novels.
Researcher Affiliation	Academia	Likun Qiu and Yue Zhang Singapore University of Technology and Design 20 Dover Drive, Singapore 138682 qiulikun@gmail.com, yue zhang@sutd.edu.sg
Pseudocode	Yes	Algorithm 1 shows pseudocode of the double propagation algorithm.
Open Source Code	No	To facilitate future comparisons, we release our annotated datasets at http://people.sutd.edu.sg/%7Eyue_zhang/publication.html. This link is for annotated datasets, not a statement of releasing the source code for the described methodology.
Open Datasets	Yes	The People s Daily Corpus, which contains all the articles of People s Daily in January 1998, is used as the source-domain annotated corpus Gen Corpus. To facilitate future comparisons, we release our annotated datasets at http://people.sutd.edu.sg/%7Eyue_zhang/publication.html.
Dataset Splits	Yes	We take 300 annotated sentences from Äl (ZX) as the development data (ZX-dev), which is used to determine the amount of sentences for a self-training baseline, and for the parameter tuning of the double propagation algorithm, the noun entity classiﬁer, and the POS classiﬁer.
Hardware Specification	No	No specific hardware details (such as GPU or CPU models, or memory specifications) are mentioned for the experimental setup.
Software Dependencies	Yes	We apply the joint segmentor and POS-tagger of Zhang and Clark (2010)1 as our baseline system. 1from http://sourceforge.net/zpar version 0.6, using the agenda implementation of chinese.postagger.
Experiment Setup	Yes	Beam-search is applied to ﬁnd a highest-scored sequence of transitions heuristically. The system scores search candidates using a linear model, which is trained using the averaged perceptron (Collins 2002) and early-update (Collins and Roark 2004). We use the ZX development data to decide the best number of target-domain sentences from 4000 to 20000, and the best result is achieved using 8000 sentences. taking words with probabilities above a threshold α as novel-speciﬁc nouns; probability above a threshold β.