Path-Normalized Optimization of Recurrent Neural Networks with ReLU Activations

Authors: Behnam Neyshabur, Yuhuai Wu, Russ R. Salakhutdinov, Nati Srebro

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On several datasets that require capturing long-term dependency structure, we show that path-SGD can significantly improve trainability of Re LU RNNs compared to RNNs trained with SGD, even with various recently suggested initialization schemes. [...] We compare the performance of SGD vs. Path-SGD with/without κ(2). [...] We evaluate Path-SGD on the Sequential MNIST problem. Table 2, right column, reports test error rates achieved by RNN-Path compared to the previously published results.
Researcher Affiliation Academia Behnam Neyshabur Toyota Technological Institute at Chicago bneyshabur@ttic.edu Yuhuai Wu University of Toronto ywu@cs.toronto.edu Ruslan Salakhutdinov Carnegie Mellon University rsalakhu@cs.cmu.edu Nathan Srebro Toyota Technological Institute at Chicago nati@ttic.edu
Pseudocode No The paper provides mathematical derivations and descriptions of algorithms, but it does not include a dedicated section or figure labeled as "Pseudocode" or "Algorithm".
Open Source Code No The paper thanks Saizheng Zhang for sharing a base code for RNNs, but it does not state that the code developed for this paper is open-source or provide a link.
Open Datasets Yes We train a single-layer RNN with H = 200 hidden units for the task of word-level language modeling on Penn Treebank (PTB) Corpus [13]. [...] For both tasks, we closely follow the experimental protocol in [12]. We train a single-layer RNN consisting of 100 hidden units with path-SGD, referred to as RNN-Path. [...] Next, we evaluate Path-SGD on the Sequential MNIST problem.
Dataset Splits Yes We use the standard split (929k training, 73k validation and 82k test) and the vocabulary size of 10k words.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies No The paper mentions using "Adam optimizer [8]" and "most deep learning libraries", but it does not specify any software versions for these or other dependencies.
Experiment Setup Yes We initialize the weights by sampling from the uniform distribution with range [ 0.1, 0.1]. The plots compare the training and test errors using a mini-batch of size 32 and backpropagating through T = 20 time steps and using a mini-batch of size 32 where the step-size is chosen by a grid search. [...] We performed grid search for the learning rates over {10 2, 10 3, 10 4} for both our model and the baseline. Non-recurrent weights were initialized from the uniform distribution with range [ 0.01, 0.01]. [...] For Re LU RNNs, we initialize the recurrent matrices from uniform[ 0.01, 0.01], and uniform[ 0.2, 0.2] for nonrecurrent weights. For LSTMs, we use orthogonal initialization [21] for the recurrent matrices and uniform[ 0.01, 0.01] for non-recurrent weights.