FRAGE: Frequency-Agnostic Word Representation

Authors: Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, Tie-Yan Liu

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted comprehensive studies on ten datasets across four natural language processing tasks, including word similarity, language modeling, machine translation, and text classification. Results show that with FRAGE, we achieve higher performance than the baselines in all tasks.
Researcher Affiliation Collaboration 1Peking University 2Key Laboratory of Machine Perception, MOE, School of EECS, Peking University 3Microsoft Research Asia 4Center for Data Science, Peking University, Beijing Institute of Big Data Research
Pseudocode Yes Algorithm 1 Proposed Algorithm
Open Source Code Yes Code for our implementation is available at https://github.com/Chengyue Gong R/Frequency Agnostic
Open Datasets Yes We use the skip-gram model as our baseline model [28]5, and train the embeddings using Enwik96. 6http://mattmahoney.net/dc/textdata.html ... We do experiments on two widely used datasets [25, 26, 41], Penn Treebank (PTB) [27] and Wiki Text-2 (WT2) [26].
Dataset Splits Yes Table 2: Perplexity on validation and test sets on Penn Treebank and Wiki Text2. ... For fair comparisons, for each task, our method shares the same model architecture as the baseline. The only difference is that we use the original task-specific loss function with an additional adversarial loss as in Eqn. (3). Dataset description and hyper-parameter configurations can be found in [12].
Hardware Specification No The paper does not specify the hardware used for running the experiments (e.g., GPU models, CPU types, or memory).
Software Dependencies No The paper mentions software like 'word2vec', 'Transformer', 'AWD-LSTM', and 'AWD-LSTM-Mo S' but does not provide specific version numbers for these or other ancillary software components like programming languages or libraries.
Experiment Setup Yes In all tasks, we simply set the top 20% frequent words in vocabulary as popular words and denote the rest as rare words... For all the tasks except training skip-gram model, we use full-batch gradient descent to update the discriminator. For training skip-gram model, mini-batch stochastic gradient descent is used to update the discriminator with a batch size 3000... For language modeling and machine translation tasks, we use logistic regression as the discriminator. For other tasks, we find using a shallow neural network with one hidden layer is more efficient and we set the number of nodes in the hidden layer as 1.5 times embedding size. In all tasks, we set the hyper-parameter λ to 0.1.