Learning Context-Specific Word/Character Embeddings

Authors: Xiaoqing Zheng, Jiangtao Feng, Yi Chen, Haoyuan Peng, Wenqing Zhang

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted three sets of experiments. The goal of the first one is to test several variants of the single-prototype CSV model to gain some understanding of how the choices of hyper-parameters impacts the performance on the word and character similarity tasks, by comparing with the wellestablished single-prototype embedding learning methods. In the second experiment, we compare our multi-sense variant with state-of-the-art multi-prototype word representation models on a data set where each word pair is presented with context. The third one is to see how well the learned embeddings to enhance the supervised learning on four standard NLP tasks (POS tagging, chunking for English, and word segmentation, named entity recognition for Chinese), and whether the performance can be further improved by their multi-prototype variants.
Researcher Affiliation Academia Xiaoqing Zheng, Jiangtao Feng, Yi Chen, Haoyuan Peng, Wenqing Zhang School of Computer Science, Fudan University, Shanghai, China Shanghai Key Laboratory of Intelligent Information Processing {zhengxq, fengjt16, yi chen15, hypeng15, wqzhang}@fudan.edu.cn
Pseudocode Yes The whole training algorithm is shown in Figure 2. Figure 2: The training algorithm of CSV model. Inputs: R: a training corpus. K: a specified number of negative samples. N: a specified number of iterations. Initialization: the parameters of the network, a global vector g(t) and a sense vector s(t) for each type t D are initialized with small random values. Output: the trained network parameters θ, a global vector g(t) and one or more sense vectors si(t) for each type t D, i = 1, 2, ..., kt. Algorithm: for STAGE = 1 to 3 do for each context window c in the corpus R. compute the context feature vector v(c) by the neural network using the equation (1). rt c = argmaxi sim(si(t), v(c)) for the target type t as the equation (5). if (STAGE = 1 or STAGE = 3) draw a set of K negative samples neg(t) randomly for the target type t. update the network parameters θ, the global vectors of types in the context window, and the rt c sense vector of the type t by the gradients with respect to the objective function (4). else if (STAGE = 2) if (sim(srtc(t), v(c)) < δ) create a new sense vector for the type t, which is initialized by the context vector v(c). else update the sense vector srtc(t) to reflect the influence of the current context feature vector v(c). until the number of iterations N is reached (N = 1 if STAGE 2). end for
Open Source Code No The paper does not provide any explicit statements about releasing source code for the methodology, nor does it include any links to code repositories.
Open Datasets Yes English and Chinese Wikipedia documents1 were used as the unlabeled corpora by all the models compared to train the embeddings because of their wide range of topics and usages. 1https://www.wikipedia.org/ (March 2015 snapshot) For English, a popular data set for evaluating vector-space models is the Word Sim-353 (Finkelstein et al. 2001), which has 353 pairs of nouns. Hill et al (2014) presented the Sim Lex-999 that explicitly quantifies similarity rather than relatedness or association so that pairs that are related but not actually similar are rated with low scores. For Chinese character similarity task, to the best of our knowledge, there is no such data set available. Following the guidance of (Hill, reichart, and Korhonen 2014), we constructed a Char Sim-200 data set, which contains two hundred Chinese character pairs. We picked Penn Chinese Treebank from Bakeoff-3 as our data set (Levow 2006). For the NER task, we choose MSRA data set from Bakeoff-3 with standard train-test splits (Levow 2006).
Dataset Splits No The paper mentions 'standard train-test splits' for MSRA dataset and using a 'validation set' for hyperparameter tuning ('The values of hyperparameters were chosen by a small amount of manual exploration on a validation set.'), but does not provide specific percentages, sample counts, or detailed splitting methodology for training, validation, and test datasets for any of the datasets used.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., programming languages, libraries, or frameworks with their versions) used in the experiments.
Experiment Setup Yes The values of hyperparameters were chosen by a small amount of manual exploration on a validation set. The network was trained by setting the windows size to 11, the dimension to 300, and δ to 0.15. All the results reported have been averaged over five runs.