Shuffling Recurrent Neural Networks
Authors: Michael Rotman, Lior Wolf9428-9435
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In an extensive set of experiments, the method shows competitive results, in comparison to the leading literature baselines. We share our implementation at https://github.com/rotmanmi/SRNN. Experiments We compare our SRNN architecture to the leading RNN architectures from the current literature. The baseline methods include: (1) a vanilla RNN, (2) LSTM (Hochreiter and Schmidhuber 1997), (3) GRU (Cho et al. 2014), (4) u RNN (Arjovsky, Shah, and Bengio 2016), (5) NRU (Chandar et al. 2019), and (6) nn RNN (Kerg et al. 2019). |
| Researcher Affiliation | Academia | Michael Rotman and Lior Wolf The School of Computer Science, Tel Aviv University rotmanmi@post.tau.ac.il, wolf@cs.tau.ac.il |
| Pseudocode | No | The SRNN layer contains two hidden-state processing components, the learned network β that is comprised of fully connected layers, and a fixed permutation matrix Wp. At each time step, the layer, like other RNNs, receives two input signals: the hidden state of the previous time step, ht 1 Rdh, and the input at the current time step, xt Rdi, where dh and di are the dimensions of the hidden state and the input, respectively. The layer computes the following, (we redefine the notation, disregarding the definitions of Sec. ): ht = σ (Wpht 1 + β (xt)) σ (zt) , (3) where σ is the activation function (such as Re LU or tanh), see Fig. 1(a). |
| Open Source Code | Yes | We share our implementation at https://github.com/rotmanmi/SRNN. |
| Open Datasets | Yes | The permuted MNIST (p MNIST) benchmark by Le, Jaitly, and Hinton (2015) measures the performance of RNNs, when modeling complex long-term dependencies. The TIMIT (Garofolo et al. 1993) speech frames prediction task was introduced by Wisdom et al. (2016) and later (following a fix to the way the error is computed) used in Lezcano-Casado and Mart ınez-Rubio (2019). The Nottingham Polyphonic Music dataset (Boulanger-Lewandowski, Bengio, and Vincent 2012) is a collection of British and American Folk tunes. |
| Dataset Splits | Yes | The train/validation/test splits and exactly the same data used by previous work are employed here: using 3640 utterances for training , 192 for validating, and 400 for testing. The training set consists of 694 tunes. 173 and 170 tunes are used as the validation and test set, respectively. |
| Hardware Specification | No | Note that u RNN could not run due to GPU memory limitations, despite an effort to optimize the code using Py Torch JIT (Paszke et al. 2019). |
| Software Dependencies | No | All methods, except NRU, employed the RMSProp (Bengio 2015) optimizer with a learning rate of 0.001 and a decay rate of 0.9. For NRU, we have used the suggested ADAM (Kingma and Ba 2014) optimizer with a learning rate of 0.001, and employed gradient clipping with a norm of one. Note that u RNN could not run due to GPU memory limitations, despite an effort to optimize the code using Py Torch JIT (Paszke et al. 2019). |
| Experiment Setup | Yes | All methods, except NRU, employed the RMSProp (Bengio 2015) optimizer with a learning rate of 0.001 and a decay rate of 0.9. For NRU, we have used the suggested ADAM (Kingma and Ba 2014) optimizer with a learning rate of 0.001, and employed gradient clipping with a norm of one. We trained all models with a minibatch of size 20. We used a hidden size of dh = 128 for all models. Network β contains one hidden layer with fr = 8, i.e., fr projects the input to activations in R8 and then to R128. A hidden size of 128 was used for all methods. All models were fed with a minibatch of 50. For SRNN, a hidden state size of dh = 1024 was used, and the function fr of network β contained three hidden layers of size 32. A minibatch size of 100 was used for training, similar to the experiments performed for NRU. Models were trained for 60 epochs. For modeling with SRNN, we use a stack of 3 SRNN layers with one hidden layer inside network β with fr = 128, and a hidden state size of dh = 2048. The optimizer used was Adam with a learning rate of 0.001, where between each intermediate SRNN layer we also apply dropout with p = 0.3 to avoid overfitting. |