On Multiplicative Integration with Recurrent Neural Networks

Authors: Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, Russ R. Salakhutdinov

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically analyze its learning behaviour and conduct evaluations on several tasks using different RNN models. Our experimental results demonstrate that Multiplicative Integration can provide a substantial performance boost over many of the existing RNN models.
Researcher Affiliation Academia Yuhuai Wu1, , Saizheng Zhang2, , Ying Zhang2, Yoshua Bengio2,4 and Ruslan Salakhutdinov3,4 1University of Toronto, 2MILA, Université de Montréal, 3Carnegie Mellon University, 4CIFAR ywu@cs.toronto.edu,2{firstname.lastname}@umontreal.ca,rsalakhu@cs.cmu.edu
Pseudocode No The paper describes mathematical formulations and derivations but does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper states 'We exactly follow the authors Theano implementation of the skip-thought model9: https://github.com/ryankiros/skip-thoughts' which refers to a third-party implementation, not the open-source code for their proposed Multiplicative Integration.
Open Datasets Yes We conduct evaluations on several tasks using different RNN models... character level language modeling, speech recognition, large scale sentence representation learning using a Skip-Thought model, and teaching a machine to read and comprehend for a question answering task. Specific datasets mentioned include Penn-Treebank dataset [11], text86 (http://mattmahoney.net/dc/textdata), Hutter Challenge Wikipedia7 (http://prize.hutter1.net/), Wall Street Journal (WSJ) corpus (LDC93S6B and LDC94S13B), SICK dataset, Microsoft Research Paraphrase Corpus, and the CNN corpus.
Dataset Splits Yes For the Wall Street Journal (WSJ) corpus, we use the full 81 hour set si284 for training, set dev93 for validation and set eval92 for test. For character level language modeling on Penn-Treebank, we follow the data partition in [12]. For text8 and Hutter Challenge Wikipedia datasets, we follow the training protocols in [12] and [1] respectively.
Hardware Specification No The paper states 'a single pass through the training data can take up to a week on a high-end GPU (as reported in [23])' in the context of the Skip-Thought model, but it does not specify the exact model or type of GPU, CPU, or any other hardware component used for their experiments.
Software Dependencies No The paper mentions using 'Adam optimization algorithm [13]', and acknowledges 'the developers of Theano [29] and Keras [30]', but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes In all of our experiments, we use the general form of Multiplicative Integration (Eq. 4)... All models have a single hidden layer of size 2048, and we use Adam optimization algorithm [13] with learning rate 1e 4. Weights are initialized to samples drawn from uniform[ 0.02, 0.02]. For text8 and Hutter Challenge Wikipedia datasets, we use Adam for optimization with the starting learning rate grid-searched in {0.002, 0.001, 0.0005}. If the validation BPC does not decrease for 2 epochs, we half the learning rate. For speech recognition, Adam with learning rate 0.0001 is used for optimization and Gaussian weight noise with zero mean and 0.05 standard deviation is injected for regularization. For Skip-Thought, Encoder and decoder are single-layer GRUs with hidden-layer size of 2400; all recurrent matrices adopt orthogonal initialization. For Attentive Reader, a single hidden layer size 240, and we follow the experimental protocol of [7] and use exactly the same settings as theirs, except we remove the gradient clipping for MI-LSTMs. For all MI models, {α, β1, β2, b} were initialized to {1, 1, 1, 0} unless otherwise specified (e.g., {2, 0.5, 0.5, 0} or {1, 0.5, 0.5, 0}).