Slim Embedding Layers for Recurrent Neural Language Models
Authors: Zhongliang Li, Raymond Kulhanek, Shaojun Wang, Yunxin Zhao, Shuang Wu
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on several data sets show that the new method can get similar perplexity and BLEU score results while only using a very tiny fraction of parameters. |
| Researcher Affiliation | Collaboration | Zhongliang Li, Raymond Kulhanek Wright State University {li.141, kulhanek.5}@wright.edu Shaojun Wang SVAIL, Baidu Research swang.usa@gmail.com Yunxin Zhao University of Missouri zhaoy@missouri.edu Shuang Wu Yitu. Inc shuang.wu@gmail.com |
| Pseudocode | Yes | The inference algorithm is listed in Algorithm 1. Algorithm 1: Inference Algorithm 1 Divide the hidden vector h into K even parts; 2 Evaluate the partial dot products for each (hidden state sub-vector, embedding) pair and cache the results; 3 Sum the result for each word according to the sub-vector mapping table; |
| Open Source Code | No | The paper states 'the code is based on the code open sourced from Kim et al. (2016)' but does not explicitly state that the authors' own code for the described methodology is open-source or provide a link to it. |
| Open Datasets | Yes | We test our method of compressing the embedding layers on various publicly available standard language model data sets ranging from the smallest corpus, PTB (Marcus, Marcinkiewicz, and Santorini 1993), to the largest, Google s Billion W corpus (Chelba et al. 2013). |
| Dataset Splits | Yes | Both the language and translation models were trained using the WMT12 data (Callison Burch et al. 2012), with the Europarl v7 corpus for training, newstest2010 for validation, and newstest2011 for test, all lowercased. |
| Hardware Specification | Yes | Both experiments using NCE take about seven days of training on a GTX 1080 GPU. |
| Software Dependencies | No | The paper mentions using 'Torch' to implement the models but does not provide specific version numbers for software dependencies. |
| Experiment Setup | Yes | The weights are initialized with uniform random values between -0.05 and 0.05. Mini-batch stochastic gradient decent (SGD) is used to train the models. tune the parameters with the Adagrad (Duchi, Hazan, and Singer 2011) algorithm. |