On orthogonality and learning recurrent networks with long term dependencies

Authors: Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, Chris Pal

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper explores issues with optimization convergence, speed and gradient stability when encouraging and enforcing orthogonality. To perform this analysis, we propose a weight matrix factorization and parameterization strategy through which we can bound matrix norms and therein control the degree of expansivity induced during backpropagation. We find that hard constraints on orthogonality can negatively affect the speed of convergence and model performance. (...) In this section, we explore hard and soft orthogonality constraints on factorized weight matrices for recurrent neural network hidden to hidden transitions. (...) We begin our analyses on tasks that are designed to stress memory: a sequence copying task and a basic addition task (Hochreiter & Schmidhuber, 1997). We then move on to tasks on real data that require models to capture long-range dependencies: digit classification based on sequential and permuted MNIST vectors (Le et al., 2015; Le Cun et al., 1998). Finally, we look at a basic language modeling task using the Penn Treebank dataset (Marcus et al., 1993).
Researcher Affiliation Academia Eugene Vorontsov 1 2 Chiheb Trabelsi 1 2 Samuel Kadoury 1 3 Chris Pal 1 2 1 Ecole Polytechnique de Montr eal, Montr eal, Canada 2Montreal Institute for Learning Algorithms, Montr eal, Canada 3CHUM Research Center, Montr eal, Canada. Correspondence to: Eugene Vorontsov <eugene.vorontsov@gmail.com>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper states 'The neural network code was built on the Theano framework (Theano Development Team, 2016)' but does not provide a specific link or statement about making their own implementation code open source.
Open Datasets Yes We begin our analyses on tasks that are designed to stress memory: a sequence copying task and a basic addition task (Hochreiter & Schmidhuber, 1997). We then move on to tasks on real data that require models to capture long-range dependencies: digit classification based on sequential and permuted MNIST vectors (Le et al., 2015; Le Cun et al., 1998). Finally, we look at a basic language modeling task using the Penn Treebank dataset (Marcus et al., 1993).
Dataset Splits Yes For MNIST and PTB, hyperparameter selection and early stopping were performed targeting the best validation set accuracy, with results reported on the test set.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU or CPU models.
Software Dependencies No The paper states 'The neural network code was built on the Theano framework (Theano Development Team, 2016)' but does not provide a specific version number for Theano or any other software dependency.
Experiment Setup Yes In all experiments, we employed RMSprop (Tieleman & Hinton, 2012) when not using geodesic gradient descent. We used minibatches of size 50 and for generated data (the copy and adding tasks), we assumed an epoch length of 100 minibatches. We cautiously introduced gradient clipping at magnitude 100 (unless stated otherwise) in all of our RNN experiments although it may not be required and we consistently applied a small weight decay of 0.0001. Unless otherwise specified, we trained all simple recurrent neural networks with the hidden to hidden matrix factorization as in (8) using geodesic gradient descent on the bases (learning rate 10 6) and RMSprop on the other parameters (learning rate 0.0001), using a tanh transition nonlinearity, and clipping gradients of 100 magnitude.