Architectural Complexity Measures of Recurrent Neural Networks

Authors: Saizheng Zhang, Yuhuai Wu, Tong Che, Zhouhan Lin, Roland Memisevic, Russ R. Salakhutdinov, Yoshua Bengio

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results show that RNNs might benefit from larger recurrent depth and feedforward depth. We further demonstrate that increasing recurrent skip coefficient offers performance boosts on long term dependency problems. We empirically evaluate models with different recurrent/feedforward depths and recurrent skip coefficients on various sequential modelling tasks. We also show that our experimental results further validate the usefulness of the proposed definitions.
Researcher Affiliation Academia 1MILA, Université de Montréal, 2University of Toronto, 3Carnegie Mellon University, 4Institut des Hautes Études Scientifiques, France, 5CIFAR
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not include an unambiguous statement or a direct link to the source code for the methodology described. It acknowledges Theano and Keras but does not provide its own code.
Open Datasets Yes Penn Treebank dataset: We evaluate our models on character level language modelling using the Penn Treebank dataset [22]. text8 dataset: Another dataset used for character level language modelling is the text8 dataset9, which contains 100M characters from Wikipedia with an alphabet size of 27. adding problem (and the following copying memory problem) was introduced in [10]. copying memory problem: Each input sequence has length of T + 20... sequential MNIST dataset: Each MNIST image data is reshaped into a 784 1 sequence, turning the digit classification task into a sequence classification one with long-term dependencies [25, 24].
Dataset Splits Yes Penn Treebank dataset: It contains 5059k characters for training, 396k for validation and 446k for test, and has a alphabet size of 50.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., specific GPU or CPU models).
Software Dependencies No The paper mentions using Adam [26] for optimization, and acknowledges Theano [28] and Keras [29], but does not provide specific version numbers for these or other software components crucial for replication.
Experiment Setup Yes For all of our experiments we use Adam [26] for optimization, and conduct a grid search on the learning rate in {10 2, 10 3, 10 4, 10 5}. For tanh RNNs, the parameters are initialized with samples from a uniform distribution. For LSTM networks we adopt a similar initialization scheme, while the forget gate biases are chosen by the grid search on { 5, 3, 1, 0, 1, 3, 5}. We employ early stopping and the batch size was set to 50.