Refining Word Representations by Manifold Learning

Authors: Chu Yonghe, Hongfei Lin, Liang Yang, Yufeng Diao, Shaowu Zhang, Fan Xiaochao

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our word representations have been evaluated on a variety of lexical-level intrinsic tasks (semantic relatedness, semantic similarity) and the experimental results show that the proposed model outperforms several popular word representations approaches.
Researcher Affiliation Academia Chu Yonghe, Hongfei Lin*, Liang Yang, Yufeng Diao, Shaowu Zhang and Fan Xiaochao Dalian University of Technology yhchu@mail.dlut.edu.cn, hflin@dlut.edu.cn, liang@dlut.edu.cn, diaoyufeng@mail.dlut.edu.cn, zhangsw@dlut.edu.cn, fxc1982@mail.dlut.edu.cn
Pseudocode Yes Algorithm 1 Refining Word Representations by Manifold Learning Input: 1: Select a window in all word vectors as the data sample for manifold learning. 2: The data samples obtained in Step 1 are used to train the MLLE algorithm according to Eq. (1) and (6). 1 2 , , , fit N X x x x MLLE . 3: The trained MLLE model is applied to test the words by re-embedding them according to Eq. (7) and (8) for test test x w y w (the word vector dimensions remain unchanged). Output: Processed representations test y w .
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes We use word vectors trained by the Glove model as the original input, along with three corpora the Common Crawl corpus consisting of 840B tokens and a vocabulary with 2.2M words (300-dimensional), Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, 50d, 100d, 200d, & 300d vectors), and Common Crawl (42B tokens, 1.9M vocab, 300d vectors) in line with the approach adopted by Pennington et al. [2014b]. We experiment with the method proposed in this paper and the baseline of Section 4.2 on two tasks, namely semantic relatedness and semantic similarity. Semantics-related tasks include the MEN dataset[Bruni et al., 2014], where 3,000 pairs of words are rated by crowd sourced participants, Wordrel-252 (WORDREL) [Agirre et al., 2009]; the MTurk dataset [Radinsky et al. 2011] where the 287 pairs of words are rated in terms of relatedness; Semantic similar task, as the first published RG65 dataset [Rubenstein et al., 1965]; the widely used Word Sim-353 (WS353) dataset [Finkelstein et al., 2001] which contains 353 pairs of commonly used verbs and nouns; the Sim Lex-999 (SIMLEX) dataset [Hill and Korhonen, 2015]where the score measures genuine similarity; and the Sim Verb-3500 (SIMVERB) dataset [Gerz et al., 2016], Wordsim-203 (WS203) [Gerz et al., 2016].
Dataset Splits No The paper uses pre-trained word vectors and evaluates on established benchmark datasets (e.g., WS353, RG65). However, it does not provide explicit train/validation/test splits for these evaluation datasets or for the word vectors used in the manifold learning process beyond selecting 'a subset of samples' or 'training word window sizes'.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper states 'We use the scikit-learn toolkit for the experiments.' but does not provide specific version numbers for this or any other software dependency.
Experiment Setup Yes The size of the training word window was set as [1001, 1501, 2001]. The value range of the MLLE algorithm neighborhood is [300, 1000]. All models are trained in triplicate and the average results are reported in Table 1 and Table2. (window start [2000, 19001], number of MLLE local neighbors [1001, 2001], window length [300, 1001], manifold dimensionality = space dimensionality).