Slim Embedding Layers for Recurrent Neural Language Models

Authors: Zhongliang Li, Raymond Kulhanek, Shaojun Wang, Yunxin Zhao, Shuang Wu

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on several data sets show that the new method can get similar perplexity and BLEU score results while only using a very tiny fraction of parameters.
Researcher Affiliation Collaboration Zhongliang Li, Raymond Kulhanek Wright State University {li.141, kulhanek.5}@wright.edu Shaojun Wang SVAIL, Baidu Research swang.usa@gmail.com Yunxin Zhao University of Missouri zhaoy@missouri.edu Shuang Wu Yitu. Inc shuang.wu@gmail.com
Pseudocode Yes The inference algorithm is listed in Algorithm 1. Algorithm 1: Inference Algorithm 1 Divide the hidden vector h into K even parts; 2 Evaluate the partial dot products for each (hidden state sub-vector, embedding) pair and cache the results; 3 Sum the result for each word according to the sub-vector mapping table;
Open Source Code No The paper states 'the code is based on the code open sourced from Kim et al. (2016)' but does not explicitly state that the authors' own code for the described methodology is open-source or provide a link to it.
Open Datasets Yes We test our method of compressing the embedding layers on various publicly available standard language model data sets ranging from the smallest corpus, PTB (Marcus, Marcinkiewicz, and Santorini 1993), to the largest, Google s Billion W corpus (Chelba et al. 2013).
Dataset Splits Yes Both the language and translation models were trained using the WMT12 data (Callison Burch et al. 2012), with the Europarl v7 corpus for training, newstest2010 for validation, and newstest2011 for test, all lowercased.
Hardware Specification Yes Both experiments using NCE take about seven days of training on a GTX 1080 GPU.
Software Dependencies No The paper mentions using 'Torch' to implement the models but does not provide specific version numbers for software dependencies.
Experiment Setup Yes The weights are initialized with uniform random values between -0.05 and 0.05. Mini-batch stochastic gradient decent (SGD) is used to train the models. tune the parameters with the Adagrad (Duchi, Hazan, and Singer 2011) algorithm.