Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections
Authors: Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rahman, James Bailey
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that the orthogonal constraint on the transition matrix applied through our parametrisation gives similar benefits to the unitary constraint, without the time complexity limitations.5. Experiments |
| Researcher Affiliation | Academia | 1The University of Melbourne, Parkville, Australia 2Data61, CSIRO, Australia. |
| Pseudocode | Yes | Algorithm 1 Local forward and backward propagations at time step t. |
| Open Source Code | Yes | Our implementation can be found at https://github. com/zmhammedi/Orthogonal_RNN. |
| Open Datasets | Yes | We used the MNIST image dataset.We tested the o RNN on the task of character level prediction using the Penn Tree Bank Corpus. |
| Dataset Splits | Yes | We split the dataset into training (55000 instances), validation (5000 instances), and test sets (10000 instances). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for the experiments (e.g., CPU, GPU models, memory, or cluster specifications). |
| Software Dependencies | No | All RNN models were implemented using the python library theano (Theano Development Team, 2016). We implemented the one-step FP and BP algorithms described in Algorithm 1 using C code. The paper mentions the use of Theano and C code but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | All RNN models were implemented using the python library theano (Theano Development Team, 2016). We set its activation function to the leaky_Re LU defined as φ(x) = max( x 10, x). For all experiments, we used the adam method for stochastic gradient descent (Kingma & Ba, 2014). We initialised all the parameters using uniform distributions similar to (Arjovsky et al., 2016). The biases of all models were set to zero, except for the forget bias of the LSTM, which we set to 5 to facilitate the learning of long-term dependencies (Koutn ık et al., 2014). All the learning rates were set to 10 3. We chose a batch size of 50. We experimented with (mini-batch size, learning rate) {(1, 10 4), (50, 10 3)}. The learning rate was set to 0.0001 for both models with a mini-batch size of 1. |