reproducibilityindex.ai

How to Construct Deep Recurrent Neural Networks

Authors: Razvan Pascanu; Caglar Gulcehre; Kyunghyun Cho; Yoshua Bengio

ICLR 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The proposed deep RNNs are empirically evaluated on the tasks of polyphonic music prediction and language modeling. The experimental result supports our claim that the proposed deep RNNs beneﬁt from the depth and outperform the conventional, shallow RNNs.
Researcher Affiliation	Academia	1D epartement d Informatique et de Recherche Op erationelle, Universit e de Montr eal, {pascanur, gulcehrc}@iro.umontreal.ca, yoshua.bengio@umontreal.ca 2Department of Information and Computer Science, Aalto University School of Science, kyunghyun.cho@aalto.fi
Pseudocode	No	The paper provides mathematical equations and architectural diagrams but no pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statement or link for open-source code.
Open Datasets	Yes	We test the RNNs on the task of polyphonic music prediction using three datasets which are Nottingham, JSB Chorales and Muse Data (Boulanger-Lewandowski et al., 2012). On the task of character-level and word-level language modeling, we use Penn Treebank Corpus (Marcus et al., 1993).
Dataset Splits	No	Training stops when the validation cost stops decreasing. ... The size of each model is chosen from a limited set {100, 200, 400, 600, 800} to minimize the validation error for each polyphonic music task.
Hardware Specification	No	We would like to thank NSERC, Compute Canada, and Calcul Qu ebec for providing computational resources.
Software Dependencies	No	We would like to thank the developers of Theano (Bergstra et al., 2010; Bastien et al., 2012).
Experiment Setup	Yes	We use stochastic gradient descent (SGD) and employ the strategy of clipping the gradient proposed by Pascanu et al. (2013a). Training stops when the validation cost stops decreasing. The cutoff threshold for the gradients is set to 1. The hyperparameter for the learning rate schedule1 is tuned manually for each dataset. We set the hyperparameter β to 2330 for Nottingham, 1475 for Muse Data and 100 for JSB Chroales. The weights of the connections between any pair of hidden layers are sparse, having only 20 nonzero incoming connections per unit... Each weight matrix is rescaled to have a unit largest singular value... The weights of the connections between the input layer and the hidden state as well as between the hidden state and the output layer are initialized randomly from the white Gaussian distribution with its standard deviation ﬁxed to 0.1 and 0.01, respectively. To regularize the models, we add white Gaussian noise of standard deviation 0.075 to each weight parameter every time the gradient is computed (Graves, 2011).