Improving the Gating Mechanism of Recurrent Neural Networks

Authors: Albert Gu, Caglar Gulcehre, Thomas Paine, Matt Hoffman, Razvan Pascanu

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our simple gating mechanisms robustly improve the performance of recurrent models on a range of applications, including synthetic memorization tasks, sequential image classification, language modeling, and reinforcement learning, particularly when long-term dependencies are involved.
Researcher Affiliation Collaboration 1Stanford University, USA 2Deep Mind, London, UK. Correspondence to: Albert Gu <albertgu@stanford.edu>, Caglar Gulcehre <caglarg@google.com>.
Pseudocode No The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not provide any explicit statement or link indicating the availability of open-source code for the described methodology.
Open Datasets Yes Empirically, our simple gating mechanisms robustly improve the performance of recurrent models on a range of applications, including synthetic memorization tasks, sequential image classification, language modeling, and reinforcement learning, particularly when long-term dependencies are involved. We test on the sequential MNIST (s MNIST), permuted MNIST (p MNIST) (Le et al., 2015), and sequential CIFAR10 (s CIFAR-10) tasks. We consider word-level language modeling on the Wiki Text103 dataset, where (i) the dependency lengths are much shorter than in the synthetic tasks, (ii) language has an implicit hierarchical structure and timescales of varying lengths. We chose the Passive match and Active match tasks from Hung et al. (2018).
Dataset Splits Yes We use the standard training and test splits for all datasets. We use the standard splits provided with the dataset, and report perplexity on the validation and test sets.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, or detailed cloud instances) used to run its experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) that would be needed to reproduce the experiments.
Experiment Setup Yes For these tasks, we used single layer models with 256 hidden units, trained using Adam with learning rate 10 3. All models are trained with Adam (Kingma & Ba, 2014) with learning rate 0.001 and batch size 64 for 100 epochs. When chrono initialization is used and not explicitly tuned, we set Tmax to be proportional to the hidden size. Following (Arjovsky et al., 2016; Tallec & Ollivier, 2018), for the copy task, we use sequence lengths N=500 for training and N=1000 for evaluation. For the adding task, we use N=200 for training and N=1000 for evaluation.