Refining Word Representations by Manifold Learning
Authors: Chu Yonghe, Hongfei Lin, Liang Yang, Yufeng Diao, Shaowu Zhang, Fan Xiaochao
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our word representations have been evaluated on a variety of lexical-level intrinsic tasks (semantic relatedness, semantic similarity) and the experimental results show that the proposed model outperforms several popular word representations approaches. |
| Researcher Affiliation | Academia | Chu Yonghe, Hongfei Lin*, Liang Yang, Yufeng Diao, Shaowu Zhang and Fan Xiaochao Dalian University of Technology yhchu@mail.dlut.edu.cn, hflin@dlut.edu.cn, liang@dlut.edu.cn, diaoyufeng@mail.dlut.edu.cn, zhangsw@dlut.edu.cn, fxc1982@mail.dlut.edu.cn |
| Pseudocode | Yes | Algorithm 1 Refining Word Representations by Manifold Learning Input: 1: Select a window in all word vectors as the data sample for manifold learning. 2: The data samples obtained in Step 1 are used to train the MLLE algorithm according to Eq. (1) and (6). 1 2 , , , fit N X x x x MLLE . 3: The trained MLLE model is applied to test the words by re-embedding them according to Eq. (7) and (8) for test test x w y w (the word vector dimensions remain unchanged). Output: Processed representations test y w . |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | We use word vectors trained by the Glove model as the original input, along with three corpora the Common Crawl corpus consisting of 840B tokens and a vocabulary with 2.2M words (300-dimensional), Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, 50d, 100d, 200d, & 300d vectors), and Common Crawl (42B tokens, 1.9M vocab, 300d vectors) in line with the approach adopted by Pennington et al. [2014b]. We experiment with the method proposed in this paper and the baseline of Section 4.2 on two tasks, namely semantic relatedness and semantic similarity. Semantics-related tasks include the MEN dataset[Bruni et al., 2014], where 3,000 pairs of words are rated by crowd sourced participants, Wordrel-252 (WORDREL) [Agirre et al., 2009]; the MTurk dataset [Radinsky et al. 2011] where the 287 pairs of words are rated in terms of relatedness; Semantic similar task, as the first published RG65 dataset [Rubenstein et al., 1965]; the widely used Word Sim-353 (WS353) dataset [Finkelstein et al., 2001] which contains 353 pairs of commonly used verbs and nouns; the Sim Lex-999 (SIMLEX) dataset [Hill and Korhonen, 2015]where the score measures genuine similarity; and the Sim Verb-3500 (SIMVERB) dataset [Gerz et al., 2016], Wordsim-203 (WS203) [Gerz et al., 2016]. |
| Dataset Splits | No | The paper uses pre-trained word vectors and evaluates on established benchmark datasets (e.g., WS353, RG65). However, it does not provide explicit train/validation/test splits for these evaluation datasets or for the word vectors used in the manifold learning process beyond selecting 'a subset of samples' or 'training word window sizes'. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper states 'We use the scikit-learn toolkit for the experiments.' but does not provide specific version numbers for this or any other software dependency. |
| Experiment Setup | Yes | The size of the training word window was set as [1001, 1501, 2001]. The value range of the MLLE algorithm neighborhood is [300, 1000]. All models are trained in triplicate and the average results are reported in Table 1 and Table2. (window start [2000, 19001], number of MLLE local neighbors [1001, 2001], window length [300, 1001], manifold dimensionality = space dimensionality). |