Towards Binary-Valued Gates for Robust LSTM Training

Authors: Zhuohan Li, Di He, Fei Tian, Wei Chen, Tao Qin, Liwei Wang, Tieyan Liu

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical studies show that (1) Although it seems that we restrict the model capacity, there is no performance drop: we achieve better or comparable performances due to its better generalization ability; (2) The outputs of gates are not sensitive to their inputs: we can easily compress the LSTM unit in multiple ways, e.g., low-rank approximation and low-precision approximation. The compressed models are even better than the baseline models without compression.
Researcher Affiliation Collaboration 1Key Laboratory of Machine Perception, MOE, School of EECS, Peking University 2Microsoft Research 3Center for Data Science, Peking University, Beijing Institute of Big Data Research.
Pseudocode No The paper provides mathematical equations for LSTM and G2-LSTM but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Codes for the experiments are available at https://github.com/zhuohan123/g2-lstm
Open Datasets Yes We used the Penn Treebank corpus that contains about 1 million words. We used two datasets for experiments on neural machine translation (NMT): (1) IWSLT 14 German English translation dataset (Cettolo et al., 2014)... (2) English German translation dataset in WMT 14...
Dataset Splits Yes The training/validation/test sets contain about 153K/7K/7K sentence pairs respectively, with words pre-processed into sub-word units using byte pair encoding (BPE) (Sennrich et al., 2016). We chose 25K most frequent sub-word units as the vocabulary for both German and English. (2) English German translation dataset in WMT 14... The training set contains 4.5M English German sentence pairs, Newstest2014 is used as the test set, and the concatenation of Newstest2012 and Newstest2013 is used as the validation set.
Hardware Specification Yes All models were trained with Ada Delta (Zeiler, 2012) on one M40 GPU.
Software Dependencies No The paper mentions optimizers like Ada Delta and encoding methods like BPE, and states its code was based on another work, but it does not provide specific version numbers for software dependencies such as programming languages or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We followed the practice in (Merity et al., 2017) to set up the model architecture for LSTM: a stacked three-layer LSTM with drop-connect (Wan et al., 2013) on recurrent weights and a variant of averaged stochastic gradient descent (ASGD) (Polyak & Juditsky, 1992) for optimization, with a 500-epoch training phase and a 500-epoch finetune phase. ...temperature τ in G2-LSTM... we set it to 0.9... set the size of word embedding and hidden state to 256. ...set the size of word embedding and hidden state to 512 and 1024 respectively. ...The mini-batch size was 32/64 for German English/English German respectively. ...Both gradient clipping norms were set to 2.0. We used tokenized case-insensitive and case-sensitive BLEU as evaluation measure for German English/English German respectively... The beam size is set to 5 during the inference step.