How to Construct Deep Recurrent Neural Networks
Authors: Razvan Pascanu; Caglar Gulcehre; Kyunghyun Cho; Yoshua Bengio
ICLR 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs benefit from the depth and outperform the conventional, shallow RNNs. |
| Researcher Affiliation | Academia | 1D epartement d Informatique et de Recherche Op erationelle, Universit e de Montr eal, {pascanur, gulcehrc}@iro.umontreal.ca, yoshua.bengio@umontreal.ca 2Department of Information and Computer Science, Aalto University School of Science, kyunghyun.cho@aalto.fi |
| Pseudocode | No | The paper provides mathematical equations and architectural diagrams but no pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement or link for open-source code. |
| Open Datasets | Yes | We test the RNNs on the task of polyphonic music prediction using three datasets which are Nottingham, JSB Chorales and Muse Data (Boulanger-Lewandowski et al., 2012). On the task of character-level and word-level language modeling, we use Penn Treebank Corpus (Marcus et al., 1993). |
| Dataset Splits | No | Training stops when the validation cost stops decreasing. ... The size of each model is chosen from a limited set {100, 200, 400, 600, 800} to minimize the validation error for each polyphonic music task. |
| Hardware Specification | No | We would like to thank NSERC, Compute Canada, and Calcul Qu ebec for providing computational resources. |
| Software Dependencies | No | We would like to thank the developers of Theano (Bergstra et al., 2010; Bastien et al., 2012). |
| Experiment Setup | Yes | We use stochastic gradient descent (SGD) and employ the strategy of clipping the gradient proposed by Pascanu et al. (2013a). Training stops when the validation cost stops decreasing. The cutoff threshold for the gradients is set to 1. The hyperparameter for the learning rate schedule1 is tuned manually for each dataset. We set the hyperparameter β to 2330 for Nottingham, 1475 for Muse Data and 100 for JSB Chroales. The weights of the connections between any pair of hidden layers are sparse, having only 20 nonzero incoming connections per unit... Each weight matrix is rescaled to have a unit largest singular value... The weights of the connections between the input layer and the hidden state as well as between the hidden state and the output layer are initialized randomly from the white Gaussian distribution with its standard deviation fixed to 0.1 and 0.01, respectively. To regularize the models, we add white Gaussian noise of standard deviation 0.075 to each weight parameter every time the gradient is computed (Graves, 2011). |