On orthogonality and learning recurrent networks with long term dependencies
Authors: Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, Chris Pal
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper explores issues with optimization convergence, speed and gradient stability when encouraging and enforcing orthogonality. To perform this analysis, we propose a weight matrix factorization and parameterization strategy through which we can bound matrix norms and therein control the degree of expansivity induced during backpropagation. We find that hard constraints on orthogonality can negatively affect the speed of convergence and model performance. (...) In this section, we explore hard and soft orthogonality constraints on factorized weight matrices for recurrent neural network hidden to hidden transitions. (...) We begin our analyses on tasks that are designed to stress memory: a sequence copying task and a basic addition task (Hochreiter & Schmidhuber, 1997). We then move on to tasks on real data that require models to capture long-range dependencies: digit classification based on sequential and permuted MNIST vectors (Le et al., 2015; Le Cun et al., 1998). Finally, we look at a basic language modeling task using the Penn Treebank dataset (Marcus et al., 1993). |
| Researcher Affiliation | Academia | Eugene Vorontsov 1 2 Chiheb Trabelsi 1 2 Samuel Kadoury 1 3 Chris Pal 1 2 1 Ecole Polytechnique de Montr eal, Montr eal, Canada 2Montreal Institute for Learning Algorithms, Montr eal, Canada 3CHUM Research Center, Montr eal, Canada. Correspondence to: Eugene Vorontsov <eugene.vorontsov@gmail.com>. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states 'The neural network code was built on the Theano framework (Theano Development Team, 2016)' but does not provide a specific link or statement about making their own implementation code open source. |
| Open Datasets | Yes | We begin our analyses on tasks that are designed to stress memory: a sequence copying task and a basic addition task (Hochreiter & Schmidhuber, 1997). We then move on to tasks on real data that require models to capture long-range dependencies: digit classification based on sequential and permuted MNIST vectors (Le et al., 2015; Le Cun et al., 1998). Finally, we look at a basic language modeling task using the Penn Treebank dataset (Marcus et al., 1993). |
| Dataset Splits | Yes | For MNIST and PTB, hyperparameter selection and early stopping were performed targeting the best validation set accuracy, with results reported on the test set. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU or CPU models. |
| Software Dependencies | No | The paper states 'The neural network code was built on the Theano framework (Theano Development Team, 2016)' but does not provide a specific version number for Theano or any other software dependency. |
| Experiment Setup | Yes | In all experiments, we employed RMSprop (Tieleman & Hinton, 2012) when not using geodesic gradient descent. We used minibatches of size 50 and for generated data (the copy and adding tasks), we assumed an epoch length of 100 minibatches. We cautiously introduced gradient clipping at magnitude 100 (unless stated otherwise) in all of our RNN experiments although it may not be required and we consistently applied a small weight decay of 0.0001. Unless otherwise specified, we trained all simple recurrent neural networks with the hidden to hidden matrix factorization as in (8) using geodesic gradient descent on the bases (learning rate 10 6) and RMSprop on the other parameters (learning rate 0.0001), using a tanh transition nonlinearity, and clipping gradients of 100 magnitude. |