cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information
Authors: Shaosheng Cao, Wei Lu, Jun Zhou, Xiaolong Li
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on the word similarity, word analogy, text classification and named entity recognition tasks show that the proposed approach consistently outperforms state-of-the-art approaches such as word-based word2vec and Glo Ve, character-based CWE, component-based JWE and pixel-based GWE. |
| Researcher Affiliation | Collaboration | Shaosheng Cao,1,2 Wei Lu,2 Jun Zhou,1 Xiaolong Li1 1 AI Department, Ant Financial Services Group 2 Singapore University of Technology and Design |
| Pseudocode | No | The paper describes the model and objective function but does not provide any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions and cites several third-party toolkits (e.g., gensim, opencc, ansj, LIBLINEAR, word2vec, GloVe, CWE, GWE, JWE) that were used, but it does not state that the source code for their proposed cw2vec method is publicly available or provide a link to it. |
| Open Datasets | Yes | We downloaded Chinese Wikipedia dump7 on November 20, 2016, which consists of 265K Chinese Wikipedia articles. (Footnote 7: https://dumps.wikimedia.org/zhwiki/20161120/) We download Fudan Corpus13, which contains 9,804 documents in 20 different topics. (Footnote 13: http://www.datatang.com/data/44139/) We used a publicly available dataset fully annotated with named entity labels15 (Footnote 15: http://bosonnlp.com/resources/Boson NLP NER 6C.zip) |
| Dataset Splits | No | For text classification, "70% of total data is used for training and the rest are used for evaluation." For Named Entity Recognition, "70% is randomly selected for training and the remaining 30% is used for evaluation." The paper specifies training and evaluation/test splits but does not explicitly mention a separate validation split or how it was handled. |
| Hardware Specification | No | The paper does not provide any specific details regarding the hardware (e.g., GPU, CPU model, memory) used to conduct the experiments. |
| Software Dependencies | No | The paper mentions several software toolkits such as "gensim toolkit", "opencc toolkit", "ansj toolkit", and "LIBLINEAR", but it does not specify version numbers for any of these components. |
| Experiment Setup | Yes | For a fair comparison between different algorithms, we use same dimension size for all word embeddings, and removed the rare words that appeared less than 10 times in the training corpus. The window size and negative samples were both set to 5. The embeddings are set as 300 dimensions (Table 1 caption). The results of skip-gram and cbow achieve the best scores when the dimension of the embeddings are set to 200 (Figure 6). |