HyperNetworks

Authors: David Ha, Andrew M. Dai, Quoc V. Le

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform experiments to investigate the behaviors of hypernetworks in a range of contexts and find that hypernetworks mix well with other techniques such as batch normalization and layer normalization. Our main result is that hypernetworks can generate non-shared weights for LSTM that work better than the standard version of LSTM (Hochreiter & Schmidhuber, 1997). On language modelling tasks with character Penn Treebank, Hutter Prize Wikipedia datasets, hypernetworks for LSTM achieve near state-of-the-art results. On a handwriting generation task with IAM handwriting dataset, hypernetworks for LSTM achieves good quantitative and qualitative results. On machine translation, hypernetworks for LSTM also obtain state-of-the-art performance on the WMT 14 en fr benchmark.
Researcher Affiliation Industry David Ha , Andrew M. Dai, Quoc V. Le Google Brain {hadavid,adai,qvl}@google.com
Pseudocode No The paper describes the model using mathematical equations and textual explanations, but does not include explicit pseudocode or algorithm blocks.
Open Source Code No For a more detailed interactive demonstration of handwriting generation using Hyper LSTM, visit http://blog.otoro.net/2016/09/28/ hyper-networks/. This is a demonstration, not an explicit statement or link for open-source code for the methodology.
Open Datasets Yes We first evaluation the Hyper LSTM model on a character level prediction task with the Penn Treebank corpus (Marcus et al., 1993)... We train our model on the larger and more challenging Hutter Prize Wikipedia dataset, also known as enwik8 (Hutter, 2012)... We will train our model on the IAM online handwriting database (Liwicki & Bunke, 2005)... Finally, we experiment with the Neural Machine Translation task using the same experimental setup outlined in (Wu et al., 2016). Our model is the same wordpiece model architecture with a vocabulary size of 32k, but we replace the LSTM cells with Hyper LSTM cells. We benchmark both models on WMT 14 En Fr...
Dataset Splits Yes We first evaluation the Hyper LSTM model on a character level prediction task with the Penn Treebank corpus (Marcus et al., 1993) using the train/validation/test split outlined in (Mikolov et al., 2012).
Hardware Specification No The model is trained in a distributed setting with a parameter sever and 12 workers. Additionally, each worker uses 8 GPUs and a minibatch of 128.
Software Dependencies No We train the model using Adam (Kingma & Ba, 2015)... No specific software versions are mentioned.
Experiment Setup Yes For character-level Penn Treebank, we use mini-batches of size 128, to train on sequences of length 100. We train the model using Adam (Kingma & Ba, 2015) with a learning rate of 0.001 and gradient clipping of 1.0. As mentioned earlier, we apply dropout to the input and output layers, and also apply recurrent dropout with a keep probability of 90%. Our setup is similar in the previous experiment, using the same mini-batch size, learning rate, weight initialization, gradient clipping parameters and optimizer. We also perform training on sequences of length 250. Our normal Hyper LSTM cell consists of 256 units, and we use an embedding size of 64. For model training, will apply recurrent dropout and also dropout to the output layer with a keep probability of 0.95. The model is trained on mini-batches of size 32 containing sequences of variable length. We trained the model using Adam (Kingma & Ba, 2015) with a learning rate of 0.0001 and gradient clipping of 5.0. Our Hyper LSTM cell consists of 128 units and a signal size of 4. First, we use only Adam without SGD at the end. Adam was used with the same same hyperparameters described in the GNMT paper: learning rate of 0.0002 for 1M training steps.