reproducibilityindex.ai

cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information

Authors: Shaosheng Cao, Wei Lu, Jun Zhou, Xiaolong Li

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results on the word similarity, word analogy, text classiﬁcation and named entity recognition tasks show that the proposed approach consistently outperforms state-of-the-art approaches such as word-based word2vec and Glo Ve, character-based CWE, component-based JWE and pixel-based GWE.
Researcher Affiliation	Collaboration	Shaosheng Cao,1,2 Wei Lu,2 Jun Zhou,1 Xiaolong Li1 1 AI Department, Ant Financial Services Group 2 Singapore University of Technology and Design
Pseudocode	No	The paper describes the model and objective function but does not provide any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions and cites several third-party toolkits (e.g., gensim, opencc, ansj, LIBLINEAR, word2vec, GloVe, CWE, GWE, JWE) that were used, but it does not state that the source code for their proposed cw2vec method is publicly available or provide a link to it.
Open Datasets	Yes	We downloaded Chinese Wikipedia dump7 on November 20, 2016, which consists of 265K Chinese Wikipedia articles. (Footnote 7: https://dumps.wikimedia.org/zhwiki/20161120/) We download Fudan Corpus13, which contains 9,804 documents in 20 different topics. (Footnote 13: http://www.datatang.com/data/44139/) We used a publicly available dataset fully annotated with named entity labels15 (Footnote 15: http://bosonnlp.com/resources/Boson NLP NER 6C.zip)
Dataset Splits	No	For text classification, "70% of total data is used for training and the rest are used for evaluation." For Named Entity Recognition, "70% is randomly selected for training and the remaining 30% is used for evaluation." The paper specifies training and evaluation/test splits but does not explicitly mention a separate validation split or how it was handled.
Hardware Specification	No	The paper does not provide any specific details regarding the hardware (e.g., GPU, CPU model, memory) used to conduct the experiments.
Software Dependencies	No	The paper mentions several software toolkits such as "gensim toolkit", "opencc toolkit", "ansj toolkit", and "LIBLINEAR", but it does not specify version numbers for any of these components.
Experiment Setup	Yes	For a fair comparison between different algorithms, we use same dimension size for all word embeddings, and removed the rare words that appeared less than 10 times in the training corpus. The window size and negative samples were both set to 5. The embeddings are set as 300 dimensions (Table 1 caption). The results of skip-gram and cbow achieve the best scores when the dimension of the embeddings are set to 200 (Figure 6).