On Multiplicative Integration with Recurrent Neural Networks
Authors: Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, Russ R. Salakhutdinov
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically analyze its learning behaviour and conduct evaluations on several tasks using different RNN models. Our experimental results demonstrate that Multiplicative Integration can provide a substantial performance boost over many of the existing RNN models. |
| Researcher Affiliation | Academia | Yuhuai Wu1, , Saizheng Zhang2, , Ying Zhang2, Yoshua Bengio2,4 and Ruslan Salakhutdinov3,4 1University of Toronto, 2MILA, Université de Montréal, 3Carnegie Mellon University, 4CIFAR ywu@cs.toronto.edu,2{firstname.lastname}@umontreal.ca,rsalakhu@cs.cmu.edu |
| Pseudocode | No | The paper describes mathematical formulations and derivations but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states 'We exactly follow the authors Theano implementation of the skip-thought model9: https://github.com/ryankiros/skip-thoughts' which refers to a third-party implementation, not the open-source code for their proposed Multiplicative Integration. |
| Open Datasets | Yes | We conduct evaluations on several tasks using different RNN models... character level language modeling, speech recognition, large scale sentence representation learning using a Skip-Thought model, and teaching a machine to read and comprehend for a question answering task. Specific datasets mentioned include Penn-Treebank dataset [11], text86 (http://mattmahoney.net/dc/textdata), Hutter Challenge Wikipedia7 (http://prize.hutter1.net/), Wall Street Journal (WSJ) corpus (LDC93S6B and LDC94S13B), SICK dataset, Microsoft Research Paraphrase Corpus, and the CNN corpus. |
| Dataset Splits | Yes | For the Wall Street Journal (WSJ) corpus, we use the full 81 hour set si284 for training, set dev93 for validation and set eval92 for test. For character level language modeling on Penn-Treebank, we follow the data partition in [12]. For text8 and Hutter Challenge Wikipedia datasets, we follow the training protocols in [12] and [1] respectively. |
| Hardware Specification | No | The paper states 'a single pass through the training data can take up to a week on a high-end GPU (as reported in [23])' in the context of the Skip-Thought model, but it does not specify the exact model or type of GPU, CPU, or any other hardware component used for their experiments. |
| Software Dependencies | No | The paper mentions using 'Adam optimization algorithm [13]', and acknowledges 'the developers of Theano [29] and Keras [30]', but it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | In all of our experiments, we use the general form of Multiplicative Integration (Eq. 4)... All models have a single hidden layer of size 2048, and we use Adam optimization algorithm [13] with learning rate 1e 4. Weights are initialized to samples drawn from uniform[ 0.02, 0.02]. For text8 and Hutter Challenge Wikipedia datasets, we use Adam for optimization with the starting learning rate grid-searched in {0.002, 0.001, 0.0005}. If the validation BPC does not decrease for 2 epochs, we half the learning rate. For speech recognition, Adam with learning rate 0.0001 is used for optimization and Gaussian weight noise with zero mean and 0.05 standard deviation is injected for regularization. For Skip-Thought, Encoder and decoder are single-layer GRUs with hidden-layer size of 2400; all recurrent matrices adopt orthogonal initialization. For Attentive Reader, a single hidden layer size 240, and we follow the experimental protocol of [7] and use exactly the same settings as theirs, except we remove the gradient clipping for MI-LSTMs. For all MI models, {α, β1, β2, b} were initialized to {1, 1, 1, 0} unless otherwise specified (e.g., {2, 0.5, 0.5, 0} or {1, 0.5, 0.5, 0}). |