FRAGE: Frequency-Agnostic Word Representation
Authors: Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, Tie-Yan Liu
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted comprehensive studies on ten datasets across four natural language processing tasks, including word similarity, language modeling, machine translation, and text classification. Results show that with FRAGE, we achieve higher performance than the baselines in all tasks. |
| Researcher Affiliation | Collaboration | 1Peking University 2Key Laboratory of Machine Perception, MOE, School of EECS, Peking University 3Microsoft Research Asia 4Center for Data Science, Peking University, Beijing Institute of Big Data Research |
| Pseudocode | Yes | Algorithm 1 Proposed Algorithm |
| Open Source Code | Yes | Code for our implementation is available at https://github.com/Chengyue Gong R/Frequency Agnostic |
| Open Datasets | Yes | We use the skip-gram model as our baseline model [28]5, and train the embeddings using Enwik96. 6http://mattmahoney.net/dc/textdata.html ... We do experiments on two widely used datasets [25, 26, 41], Penn Treebank (PTB) [27] and Wiki Text-2 (WT2) [26]. |
| Dataset Splits | Yes | Table 2: Perplexity on validation and test sets on Penn Treebank and Wiki Text2. ... For fair comparisons, for each task, our method shares the same model architecture as the baseline. The only difference is that we use the original task-specific loss function with an additional adversarial loss as in Eqn. (3). Dataset description and hyper-parameter configurations can be found in [12]. |
| Hardware Specification | No | The paper does not specify the hardware used for running the experiments (e.g., GPU models, CPU types, or memory). |
| Software Dependencies | No | The paper mentions software like 'word2vec', 'Transformer', 'AWD-LSTM', and 'AWD-LSTM-Mo S' but does not provide specific version numbers for these or other ancillary software components like programming languages or libraries. |
| Experiment Setup | Yes | In all tasks, we simply set the top 20% frequent words in vocabulary as popular words and denote the rest as rare words... For all the tasks except training skip-gram model, we use full-batch gradient descent to update the discriminator. For training skip-gram model, mini-batch stochastic gradient descent is used to update the discriminator with a batch size 3000... For language modeling and machine translation tasks, we use logistic regression as the discriminator. For other tasks, we find using a shallow neural network with one hidden layer is more efficient and we set the number of nodes in the hidden layer as 1.5 times embedding size. In all tasks, we set the hyper-parameter λ to 0.1. |